Character-Level Translation with Self-attention

Code for the paper Character-Level Translation with Self-attention, accepted at ACL 2020.

Corpora and experiments

We test our model on two corpora:

the WMT2015 German - English dataset. We use the original dataset. Link to our data
the United Nations Parallel Corpus (UNPC). Link to our data

Preparations

Download our code and install our version of Fairseq

We use fairseq (Ott et al. in 2019) as a base to implement our model. To install our fairseq snapshot, run the following commands:

git clone https://github.com/CharizardAcademy/convtransformer.git
pip install -r requirements.txt # install dependencies
python setup.py build # build fairseq
python setup.py develop

To make Fairseq work on the character-level, we modify tokenizer.py here.

Data preprocessing

We use Moses (Koehn et al. in 2007) to clean and tokenize the data, by appying the following scripts:

mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
mosesdecoder/scripts/tokenizer/tokenizer.perl
mosesdecoder/scripts/training/clean-corpus-n.perl

Converting Chinese texts to Wubi texts

To convert a text of raw Chinese characters into a text of corresponding Wubi codes, run the following commands:

cd convtransformer/

python convert_text.py --input-doc path/to/the/chinese/text --output-doc path/to/the/wubi/text --convert-type ch2wb

The convert_text.py is available at https://github.com/duguyue100/wmt-en2wubi.

Bilingual Training Data

To construct training sets for bilingual translation, run the following commands (example for UNPC French - English):

cd UN-corpora/
cd ./en-fr

paste -d'|' UNv1.0.en-fr.fr UNv1.0.en-fr.en | cat -n |shuf -n 1000000 | sort -n | cut -f2 > train.parallel.fr-en


cut -d'|' -f1 train.parallel.fr-en > 1mil.train.fr-en.fr
cut -d'|' -f2 train.parallel.fr-en > 1mil.train.fr-en.en

Multilingual Training Data

To construct training sets for multilingual translation, run the following commands (example for UNPC French + Spanish - English):

cat train.parallel.fr-en train.parallel.es-en > concat.train.parallel.fres-en

shuf concat.train.parallel.fres-en > shuffled.train.parallel.fres-en

cut -d'|' -f1 shuffled.train.parallel.fres-en > 2mil.train.fres-en.fres
cut -d'|' -f2 shuffled.train.parallel.fres-en > 2mil.train.fres-en.en

Data Binarization

The next step is binarize the data. Example for UNPC French + Spanish - English:

mkdir UN-bin/multilingual/fres-en/test-fr/
mkdir UN-bin/multilingual/fres-en/test-es/

cd convtransformer/

evaluation on French input

python preprocess.py --source-lang fres --target-lang en \
--trainpref UN-processed/multilingual/fres-en/test-fr/2mil.train.fres-en/ \
--validpref UN-processed/multilingual/fres-en/test-fr/2mil.valid.fres-en/ \
--testpref UN-processed/multilingual/fres-en/test-fr/2mil.test.fres-en.fr/ \
--destdir UN-bin/multilingual/fres-en/test-fr/ \ 
--nwordssrc 10000 --nwordstgt 10000

evaluation on Spanish input

python preprocess.py --source-lang fres --targe-lang en \
--trainpref UN-processed/multilingual/fres-en/test-es/2mil.train.fres-en/ \
--validpref UN-processed/multilingual/fres-en/test-es/2mil.valid.fres-en/ \
--testpref UN-processed/multilingual/fres-en/test-es/2mil.test.fres-en.es/ \
--destdir UN-bin/multilingual/fres-en/test-es/ \
--nwordssrc 10000 --nwordstgt 10000

Convtransformer model

The model is implemented here.

Training

We train our models on 4 NVIDIA 1080x GPUs, using Adam:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py UN-bin/multilingual/fres-en/test-es/ \
--arch convtransformer --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0001 \
--min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --max-tokens 3000  \
--save-dir checkpoints-conv-multi-fres-en/ \
--no-progress-bar --log-format simple --log-interval 2000 \
--find-unused-parameters --ddp-backend=no_c10d

where --ddp-backend=no_c10d and --find-unused-parameters are crucial arguments to train the convtransformer model. You should change CUDA_VISIBLE_DEVICES according to the hardware you have available.

Inference

We compute BLEU using Moses.

Evaluation on test set

As an example, to evaluate the test set, run conv-multi-fres-en.sh to generate translation files of each individual checkpoint. To compute the BLEU score of one translation file, run:

cd geneations/conv-multi-fres-en/
cd ./test-fr/

bash geneation_split.sh

rm -f generation_split.sh.sys generation_split.sh.ref 

mkdir split

mv generate*.out.sys ./split/
mv generate*.out.ref ./split/

cd ./split/

perl multi-bleu.perl generate30.out.ref < generate30.out.sys

Evaluation on with manual input

To generate translation by manually inputting the sentence, run:

cd convtransformer/

python interactive.py -source_sentence "Violación: uso de cloro gaseoso por el régimen sirio." \ 
-path_checkpoint "checkpoints-conv-multi-fres-en/checkpoint30.pt" \
-data_bin "UN-bin/multilingual/fres-en/test-es/"

This will print out the translated sentence in the terminal.

Analysis

Canonical Correlation Analysis

We compute the correlation coefficients with the CCA algorithm using the encoder-decoder attention matrix from the 6.th last model layer.

An an example, to obtain the attention matrices, run:

cd convtransformer/ 

bash attn_matrix.sh

To compute the correlation coefficients, run:

python cca.py -path_X "/bilingual/attention/matrix/" -path_Y "/multilingual/attention/matrix/"

Citation

@inproceedings{gao2020character,
  title={Character-level {T}ranslation with {S}elf-attention},
  author={Yingqiang Gao and Nikola I. Nikolov and Yuhuang Hu and Richard H.R. Hahnloser},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  publisher = "Association for Computational Linguistics",
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.idea		.idea
README		README
convtransformer		convtransformer
mteval		mteval
.DS_Store		.DS_Store
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

README

README

convtransformer

convtransformer

mteval

mteval

.DS_Store

.DS_Store

.gitignore

.gitignore

readme.md

readme.md

Repository files navigation

Character-Level Translation with Self-attention

Corpora and experiments

Preparations

Download our code and install our version of Fairseq

Data preprocessing

Converting Chinese texts to Wubi texts

Bilingual Training Data

Multilingual Training Data

Data Binarization

Convtransformer model

Training

Inference

Evaluation on test set

Evaluation on with manual input

Analysis

Canonical Correlation Analysis

Citation

About

Releases

Packages

Contributors 2

Languages

CharizardAcademy/convtransformer

Folders and files

Latest commit

History

Repository files navigation

Character-Level Translation with Self-attention

Corpora and experiments

Preparations

Download our code and install our version of Fairseq

Data preprocessing

Converting Chinese texts to Wubi texts

Bilingual Training Data

Multilingual Training Data

Data Binarization

Convtransformer model

Training

Inference

Evaluation on test set

Evaluation on with manual input

Analysis

Canonical Correlation Analysis

Citation

About

Resources

Stars

Watchers

Forks

Languages