Class Conditional Variational Transformer

Code for our paper "Transformers as Neural Augmentors: Class Conditional Sentence Generation with Variational Bayes".

Abstract: Data augmentation methods for Natural Language Processing tasks are explored in recent years, however they are limited and it is hard to capture the diversity on sentence level. Besides, it is not always possible to perform data augmentation on supervised tasks. To address those problems, we propose a neural data augmentation method, which is a combination of Conditional Variational Autoencoder and encoder-decoder Transformer model. While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition. Following the developments in the past years on pre-trained language models, we train and evaluate our models on several benchmarks to strengthen the downstream tasks. We compare our method with 3 different augmentation techniques. The presented results show that, our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.

Training A Tokenizer

python train_tokenizer.py \
       --dataframe "train.csv" \
       --cased "true" \ 
       --preprocess "true" \
       --tokenizer "space" \
       --feature_name "sentence"

Please take a look at arguments in train_tokenizer.py file if you want to configure the tokenizer.

Vocab and config files of your trained tokenizer will be saved under ./tokenizer directory.

Training Class Conditional Variational Transformer

main.py will use trained tokenizer which is saved under ./tokenizer directory.

python main.py \ 
       --df_train "train.csv" \
       --df_test "test.csv" \
       --preprocess "true" \
       --epochs 90 \
       --tokenizer "space" \
       --max_seq_len 128 \
       --df_sentence_name "sentence" \
       --df_target_name "target" \
       --cuda "true" \
       --batch_size 32 \
       --posterior_collapse "true" \
       --initial_learning_rate 0.0005 \
       --noise "false" \
       --n_classes 6 \
       --latent_size 32

Please take a look at arguments in main.py file if you want to configure hyperparameters of proposed model's, training configuration, or load model and resume training.

model_params.json, model.pt and optimizer.pt (scheduler.pt if used) files will be saved under main directory.

Generating Sentences

Please use generate.ipynb notebook to generate new sentences for data augmentation.

Finetuning

We provide finetuning scripts as well at ./benchmarks/models. However, anyone can write their own finetuning code.

Pre-training Class Conditional Variational Transformer

We haven't experimented much our pre-training objective and code. To pre-train Class Conditional Variational Transformer, we use denoising sequence-to-sequence pre-training, which is proposed by Lewis et al., 2019.

Train a tokenizer:

python train_tokenizer.py \
       --dataframe "wiki.train.tokens" \
       --cased "true" \
       --preprocess "false" \
       --tokenizer "bpe"

Pre-train Class Conditional Variational Transformer:

python main_pretraining.py \
       --train_corpus "wiki.train.tokens" \
       --test_corpus "wiki.test.tokens" \
       --batch_size 32 \
       --epochs 150 \
       --tokenizer "bpe" \
       --max_seq_len 256 \
       --latent_size 32 \
       --initial_learning_rate 0.0005 \
       --posterior_collapse "true" \

Finetuning Class Conditional Variational Transformer

After pre-training model_params.json, model.pt and optimizer.pt (scheduler.pt if used) files will be saved under main directory. Following the same training script (with a few additional arguments), you can train your pre-trained Class Conditional Variational Transformer to generate new sentences for data augmentation.

python main.py \ 
       --df_train "train.csv" \
       --df_test "test.csv" \
       --preprocess "true" \
       --epochs 90 \
       --tokenizer "space" \
       --max_seq_len 128 \
       --df_sentence_name "sentence" \
       --df_target_name "target" \
       --cuda "true" \
       --batch_size 32 \
       --posterior_collapse "true" \
       --initial_learning_rate 0.0005 \
       --noise "false" \
       --n_classes 6 \
       --latent_size 32 \
       --model "model.pt" \
       --model_params "model_params.json" \
       --pretraining "true"

We implemented four finetuning procedure in finetune.py file. However, we haven't give it as an argument in main.py file. If anybody want to freeze custom layers, please give a dictionary to constructor in main.py file:

if args.pretraining == "true":
   layers = {"enc": [0, 1, 2], "dec": [1, 2, 3]} 
   # freeze encoder but last MHSA, freeze decoder but first MHSA
   freezer = Freezer(layers)
   model = freezer.freeze(model)

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
assets		assets
benchmarks		benchmarks
methods		methods
models		models
notebooks		notebooks
pretraining		pretraining
tokenizer		tokenizer
utils		utils
.gitignore		.gitignore
README.md		README.md
generate_samples.py		generate_samples.py
main.py		main.py
main_pretraining.py		main_pretraining.py
requirements.txt		requirements.txt
train_tokenizer.py		train_tokenizer.py

safakkbilici/Conditional-Variational-Transformer

Folders and files

Latest commit

History

Repository files navigation

Class Conditional Variational Transformer

Training A Tokenizer

Training Class Conditional Variational Transformer

Generating Sentences

Finetuning

Pre-training Class Conditional Variational Transformer

Finetuning Class Conditional Variational Transformer

About

Resources

Stars

Watchers

Forks

Languages