DATA (data/)

kumarvon2018-data/fren/

train.tok.true.en, train.tok.true.fr
Preprocessed IWSLT'16 train files

tst201314.tok.true.en, tst201314.tok.true.fr
Preprocessed IWSLT'16 validation files

tst201516.tok.true.en, tst201516.tok.true.fr
Preprocessed IWSLT'16 test files

All other files are intermediates.

gen/1/

train.src, train.lang, train.tgt
Training data consisting of 1% autoencoding data, constructed from the train.tok.true.* files above.

train.shuffled.src, train.shuffled.lang, train.shuffled.tgt
Same as above, but shuffled.

trans/

train.shuffled.src, train.shuffled.tgt
Same as above, but language token prepended - ready to be fed to transformer.

gen/1fr/

train.src, train.lang, train.tgt
Training data consisting of 1% autoencoding data, constructed from the train.tok.true.* files above, BUT the autoencoding data is French and not English.

train.shuffled.src, train.shuffled.lang, train.shuffled.tgt
Same as above, but shuffled.

trans/

train.shuffled.src, train.shuffled.tgt
Same as above, but language token prepended - ready to be fed to transformer.

gen/5/

train.src, train.lang, train.tgt
Training data consisting of 5% autoencoding data, constructed from the train.tok.true.* files above.

train.shuffled.src, train.shuffled.lang, train.shuffled.tgt
Same as above, but shuffled.

trans/

train.shuffled.src, train.shuffled.tgt
Same as above, but language token prepended - ready to be fed to transformer.

gen/trans/

en_val.en, fr_val.fr
Same as the tst201314.tok.true.* files above, but with English/French start tokens prepended - to be fed as validation data to transformer models for paraphrasing.

fr_test.en, en_test.fr
Same as the tst201516.tok.true.* files above, but with English/French start tokens prepended - to be fed as test data to transformer models for translation.

en_val.en, fr_val.fr
Same as the tst201314.tok.true.* files above, but with English/French start tokens prepended - to be fed as test data to transformer models for paraphrasing.

gen/twoway/

train.src, train.lang, train.tgt
Training data consisting of 0% autoencoding data, constructed from the train.tok.true.* files above.

train.shuffled.src, train.shuffled.lang, train.shuffled.tgt
Same as above, but shuffled.

trans/

train.shuffled.src, train.shuffled.tgt
Same as above, but language token prepended - ready to be fed to transformer.

ruen_unused/

train.ru Monolingual Russian corpus

train.tags.en-ru-en, train.tags.en-ru.ru Training parallel data in XML format.

IWSLT14.TED.dev2010.en-ru.en.xml, IWSLT14.TED.dev2010.en-ru.ru.xml Dev set of parallel data in XML format.

IWSLT14.TED.test201*.en-ru.en.xml, IWSLT14.TED.test201*.en-ru.ru.xml Tests sets of parallel data in XML format.

make_train_data.py
Create train data by combining all the parallel data.

full_train.en, ful_train.ru The output of the above code, complete parallel data.

ru-en_sachin/

split/ The actual data provided by Sachin in two parts which I combined.

train.en, rain.ru Training data

train.en.tok.true.en, rain.ru.tok.true.ru Training datam tokenized and truecased.

paranmt/

filtered_paranmt_0.7_0.5
The complete dataset provided by John

sample_data.py
Create train, test and validation data from the above dataset.

train.src, train.tgt
Training set sampled from the above
train.tok.true.en, train.tok.true.en
Preprocessed train files

val.src, val.tgt
Validation set sampled from the above
val.tok.true.en, val.tok.true.en
Preprocessed validaiton files

test.src, test.tgt
Test set sampled from the above
test.tok.true.en, ttest.tok.true.en
Preprocessed test files

onepara/

train.src, train.lang, train.tgt
Training data consisting of 1% autoencoding data, where the autoencoding data is from ParaNMT, wherease the translation data is from IWSLT.

trans/

train.src, train.lang, train.tgt
Same as above, but shuffled and language token prepended - ready to be fed to transformer.

en-fr-paraphrase-john/

en.txt, fr.txt
The complete dataset provided by John

sample_data.py
Create train, test and validation data from the above dataset.

eng-train.txt, fren-train.txt
Training set sampled from the above
eng-train.tok.true.en, fren-train.tok.true.en
Preprocessed train files

eng-val.txt, fren-val.txt
Validation set sampled from the above
eng-val.tok.true.en, fren-val.tok.true.en
Preprocessed validaiton files

eng-test.txt, fren-test.txt
Test set sampled from the above
eng-test.tok.true.en, fren-ttest.tok.true.en
Preprocessed test files

###1/

train.src, train.lang, train.tgt
Training data consisting of 1% autoencoding data, constructed from the *-train.txt files above.

train.shuffled.src, train.shuffled.lang, train.shuffled.tgt
Same as above, but shuffled.

trans/

train.shuffled.src, train.shuffled.tgt
Same as above, but language token prepended - ready to be fed to transformer.

All other files are intermediates.

johnpara/

train.src, train.lang, train.tgt
Training data consisting of 4000-size autoencoding data, where the autoencoding data is from ParaNMT, wherease the translation data is from en-fr-paraphrase-john.

trans/

train.src, train.lang, train.tgt
Same as above, but shuffled and language token prepended - ready to be fed to transformer.

orig_seq2seq-con-trans/

Sachin's original translation with OpenNMT-py transformer code. The actual transformer code is inside OpenNMT-py/.

seq2seq-con-trans/

Copy of the above folder, but modified to accept a language specific start tokens while decoding.

seq2seq-con-trans-mv/

Copy of the above folder, but modified to handle a maximum of three validation sets instead of one.

UTILITY CODE (util-code/)

add_noise
Adds noise to data by shuffling words within a given window size and by dropping out some words.

create_test_file
Create a test file out of a large dataset.

vecmap/
The cross-lingual embedding vector mapping tool that was used to map the French and English fasttext embeddings into the same space, using a seed dictionary.

moses/
The Moses preprocessing library, modified a little to handle UTF-8.

bert-score/
The BERTScore python library, modified a little to handle UTF-8. Although it can be installed from pip, the modified version is needed in some cases.

combine_emb_files.py
Combined two embedding files into one - used to combine the French and English embeddings for bilingual models.

create_lang_file.py
Creates a .lang file with the target language for each data point where we have autoencoding data in one language.

create_double_lang_file.py
Creates a .lang file with the target language for each data point where we have autoencoding data in two languages.

shuffle.py
Shuffle data along with the language descriptors in the .lang file.

prepend_lang_tokens.py
Prepends the language specific start tokens to the data based on the langauge descriptors in the .lang file.

val_plotter
Plot validation losses of a model against number of training steps

postprocess_preds.py
Post-process test predictions by replacing French words with English equivalents using a dictionary.

EVALUATION CODE (eval-code/)

stanford/
The downloaded unzipped Stanford CoreNLP library.

bleu.sh
BLEU Score calculator.

bert_per_line.py
BERTScore Calculator that outputs the score for each line into a file.

diversity_analysis.py
Computes a set of diversity metrics for a sample of the original test set and its predictions by a baseline and a model. This sample is selected as the set of sentences for which the both the predictions have a higher BertScore than a manually set threshold.

bleu1.py
BLEU1 Score calculator.

iou.py
IOU Score calculator.

wer.py
WER Score calculator.

iou_per_line.py
IOU Score calculator - that outputs the IOU of each line into a file.

wer_per_line.py
WER Score calculator - that outputs the WER of each line into a file.

meteor.py
METEOR Score calculator

sari.py
SARI Score calculator.

parse.py
Sentence Parser
tree_utils.py
Utility Functions.
ted.py
TED Score calculator. ted_per_line.py
TED Score calculator - that outputs the TED of each line into a file.

paraphrase_perc.py
Paraphrasing Percentage calculator.

EMBEDDINGS (embeddings/)

corpus.fasttext.txt, corpus.fasttext.fr
Fasttext embeddings on a big IWSLT corpus, used in the original VMF paper, provided by Sachin. corpus.fasttext.both
Both the language embeddings concatenated.

original.fasttext.txt, original.fasttext.fr
Pre-trained fasttext embeddings obtained directly from the Fasttext website.

mapped.fasttext.txt, mapped.fasttext.fr
The above embeddings mapped into the same space using a cross-lingual vector mapping tool. mapped.fasttext.both
Both the language embeddings concatenated.

en-fr.seed, fr-en.seed
Seed dictionaries in both the directions, used by the cross-lingual mapping tool

eng.txt
List of all English words.

DUMP (dump/)

In all the folders, files of the form vmf.vocab.pt, vmf.train\*.pt and vmf.valid\*.pt correspond to preprocessed data for the model.

vmf
ParaVMF model with IWSLT autoencoding data

vmfpara
ParaVMF model with ParaNMT autoencoding data

ce
ParaCE model with IWSLT autoencoding data

frenvmf
French to English translation model with VMF loss

frence
French to English translation model with CE loss

enfrvmf
English to French translation model with VMF loss

enfrce
English to French translation model with CE loss

vmffr
ParaVMF model with IWSLT French autoencoding data

cefr
ParaCE model with IWSLT French autoencoding data

big
ParaVMF model with big dataset - (40K) 1% autoencoding data

big_4k_vmf
ParaVMF model with big dataset - 4K autoencoding data

bigpara_4k_vmf
ParaVMF model with big dataset - 4K ParaNMT autoencoding data

superpara_vmf
Supervised paraphrasing model on ParaNMT with VMF loss

superpara_ce
Supervised paraphrasing model on ParaNMT with CE loss

SCRIPTS (scripts/)

data_creation.sh
Create data with two-way translation and auto-encoding in one language.

noisy_data_creation.sh
Create data with two-way translation and noisy auto-encoding in one language.

noisy_double_lang_data_creation.sh
Create data with two-way translation and noisy auto-encoding in two languages.

tokenize.sh
Using Moses to tokenize files, train a truecaser model and use that to truecase files.

vmf.sh
ParaVMF Model with two sources of autoencoding data: IWSLT and ParaNMT. Translation data is always IWSLT, of course.

vmf_joint.sh
ParaVMF Model with IWSLT autoencoding data in both English and French.

ce.sh
ParaCE Model with IWSLT autoencoding data in both English and French

translation_vmf.sh
Translation models, both ways - VMF loss.

translation_ce.sh
Translation models, both ways - CE loss.

bilingual_pivoting.sh
Bilingual Pivoting using the already trained translation models - both VMF and CE losses.

french_paraphrasing.sh
ParaVMF, ParaCE, Bilingual Pivoting with VMF and Bilingual Pivoting with CE models - to paraphrase in French. The ParaVMF and ParaCE have the autoencoding data in French, whereas the bilingual pivoting models use the existing translation models.

backtranslation.sh
Backtranslation using the already trained translation models - both VMF and CE losses.

big_vmf.sh
Models using John's dataset with 4000 autoencoding data points rather than 1% in the ParaVMF model with two types of autoencoding data: with and without noising.

big_ce.sh
Models using John's dataset with 4000 autoencoding data points rather than 1% in the ParaCE model with two types of autoencoding data: with and without noising.

big_vmf_french.sh
Models using John's dataset with 4000 denoising data points, but for paraphrasing in French with the vMF loss.

big_ce_french.sh
Models using John's dataset with 4000 denoising data points, but for paraphrasing in French with the CE loss.

ablation.sh
Ablation models for ParaVMF:

No language start token at source/encoder
No autoencoding data at all

evaluation.sh
Post-processing and Evaluation with different metrics

evaluation_per_line.sh
Post-processing and Evaluation with different metrics for each individual sentence

evaluation_util.sh
Utility script to assign variables helpful for evaluation. Add lines here for each new model.

multiple_validation_sets.sh
Preprocess data when feeding multiple validation sets.

supervision.sh
Supervised training on ParanMT dataset - VMF and CE losses

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
eval-code		eval-code
scripts		scripts
seq2seq-con-trans		seq2seq-con-trans
util-code		util-code
.DS_Store		.DS_Store
README.md		README.md

monisha-jega/paraphrasing_embedding_outputs

Folders and files

Latest commit

History

Repository files navigation

DATA (data/)

kumarvon2018-data/fren/

gen/1/

gen/1fr/

gen/5/

gen/trans/

gen/twoway/

ruen_unused/

ru-en_sachin/

paranmt/

onepara/

en-fr-paraphrase-john/

johnpara/

orig_seq2seq-con-trans/

seq2seq-con-trans/

seq2seq-con-trans-mv/

UTILITY CODE (util-code/)

EVALUATION CODE (eval-code/)

EMBEDDINGS (embeddings/)

DUMP (dump/)

SCRIPTS (scripts/)

About

Resources

Stars

Watchers

Forks

Languages