GitHub - protonish/cipherdaug-nmt: Official code for the paper CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation published at ACL 2022 main conference.

The official code for our ACL 2022 long paper CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation.

Data Prep, Ciphertexts and more

All example scripts are based the IWSLT14 De:arrow_right:En. All the bash scripts are sufficiently annotated for reference.

You can find the data in the data directory. Run the following to unpack the original source-target parallel data.

tar -xvzf de-en.tar.gz

The compressed tar file dex-en.tar.gz contains the ciphertext files from the experiments for reference. Rather than directly unpacking it, follow the procedure below to reproduce/recreate the data.

Generating ciphertexts from plaintext

The simplest way to generate ciphertexts based on the python script for any input text is

python cipher/encipher.py -i  path/to/inoput-file --keys a-key-value \
    --char-dict-path path/to/store/char-dictionary/necessary > output/path/and/filename

Now, for intended usage and ease of generating named data files, look at the bash script encipher.sh. This script is for IWSLT14 De-En but can easily be changed to other lang pairs.

bash encipher.sh -s de -t en -x src

the -x denotes the (src/tgt) side to encipher and we recommend using -x src only. Note that -x tgt has been removed since the initial phases of the project.

Inside the bash file, you can set the exact keys and splits (train/valid/test) for the ROT-k you want. These keys will form the filenames of the enciphered versions of the source.

KEYS=(1 2 3 4 5) # filenames as {key: suffix} dict := {1: de1, 2: de2, 3: de3, N: deN} etc.
SPLITS=("train" "valid" "test")

Note: This will script will produce enciphered version of the relevant data in a directory named dex-en (for args -s de -t en -x src).

Preprocessing

Note that our preprocessing is slightly different from the standard preprocessing from the fairseq example -- (1) we use sentencepice instead of subword-nmt, and (2) we do NOT use moses to tokenize the data (we do 'clean' it with moses though) as it messes with the process of generating ciphertexts.

For creating parallel data, learning and applying BPEs on all relevant files at once, use the multi_preprocessing.sh

# bash multi_preprocessing.sh [src] [srcx-tgt]
bash multi_preprocessing.sh de dex-en

Then use multi_binarize.sh to generate joint multilingual dictionary and binary files for fairseq to use

bash multi_binarize.sh

Training and Evaluation based on FairSeq-CipherDAug

Our adaptation of FairSeq is crucial for the working of this codebase. More details on the changes here.

Example training script

train_cipherdaug.sh comes loaded with all relevant details to set hyperparameters and start training

bash train_cipherdaug.sh

Example evaluation script

This script generates translations and calculates both multibleu and sacreBLEU scores.

# bash gen_and_bleu.sh [split] [src] [tgt]
# split : train/valid/test
# src : de/de1/de2 ; tgt en

bash gen_and_bleu.sh test de en

Cite

Please consider citing us of you find any part of our code or work useful:

@inproceedings {kambhatla-etal-2022-cipherdaug,
   abbr="ACL",
   title = "CipherDAug: Ciphertext Based Data Augmentation for Neural Machine Translation",
   author = "Kambhatla, Nishant and
   Born, Logan and
   Sarkar, Anoop",
   booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Long Paper (To Appear)",
   month = may,
   year = "2022",
   address = "Online",
   publisher = "Association for Computational Linguistics",
   }

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cipher		cipher
data/iwslt14		data/iwslt14
fairseq-cipherdaug @ f268173		fairseq-cipherdaug @ f268173
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
encipher.sh		encipher.sh
gen_and_bleu.sh		gen_and_bleu.sh
multi_binarize.sh		multi_binarize.sh
multi_preprocess.sh		multi_preprocess.sh
train_cipherdaug.sh		train_cipherdaug.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cipher

cipher

data/iwslt14

data/iwslt14

fairseq-cipherdaug @ f268173

fairseq-cipherdaug @ f268173

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

encipher.sh

encipher.sh

gen_and_bleu.sh

gen_and_bleu.sh

multi_binarize.sh

multi_binarize.sh

multi_preprocess.sh

multi_preprocess.sh

train_cipherdaug.sh

train_cipherdaug.sh

Repository files navigation

Data Prep, Ciphertexts and more

Generating ciphertexts from plaintext

Preprocessing

Training and Evaluation based on FairSeq-CipherDAug

Example training script

Example evaluation script

Cite

About

Releases

Packages

Languages

License

protonish/cipherdaug-nmt

Folders and files

Latest commit

History

Repository files navigation

Data Prep, Ciphertexts and more

Generating ciphertexts from plaintext

Preprocessing

Training and Evaluation based on FairSeq-CipherDAug

Example training script

Example evaluation script

Cite

About

Resources

License

Stars

Watchers

Forks

Languages