Skip to content

Code for our paper "Variational Sentence Augmentation for Masked Language Modeling", ASYU 2021.

Notifications You must be signed in to change notification settings

safakkbilici/Variational-Sentence-Augmentation-For-Masked-Language-Modeling

Repository files navigation

Variational Sentence Augmentation For Masked Language Modeling

Code for our paper "Variational Sentence Augmentation For Masked Language Modeling" (Innovations in Intelligent Systems and Applications Conference, ASYU 2021).

Paper Link

From abstract: We introduce a variational sentence augmentation method that consists of Variational Autoencoder and Gated Recurrent Unit. The proposed method for data augmentation benefits from its latent space representation, which encodes semantic and syntactic properties of the language. After learning the representation of the language, the model generates sentences from its latent space by sequential structure of GRU. By augmenting existing unstructured corpus, the model improves Masked Language Modeling on pre-training. As a result, it improves fine-tuning as well. In pre-training, our method increases the prediction rate of masked tokens. In fine-tuning, we show that variational sentence augmentation can help semantic tasks and syntactic tasks. We make our experiments and evaluations on a limited dataset containing Turkish sentences, which also stands for a contribution to low resource languages.

Train On Your Corpus

Organize your folder structure as:

      data---
            |
            -- corpus.train.txt
            |
            -- corpus.valid.txt

then

python3 train_vae.py --data_name "corpus" --print_every 50 --epochs 1

for more detailed arguments, see the source file.

Generate New Sentences

python3 augment.py  --data_name "corpus" \
                    --checkpoint "/models/vae_epoch{epoch}.pt" \
                    --generate_iteration 100 --unk_threshold 0

for more detailed arguments, see the source file.

The augmented sentences are saved in augmentations.txt. Merge this file with original corpus.

Increase The Performance Of Pretraining

python3 pretrain_bert.py --epochs 1 \
                         --tokenizer "./tokenizer" \
                         --data "data/corpus.joined.txt"

for more detailed arguments, see the source file.

Increase The Performance Of Finetuning (Sequence Classification)

Prepare you dataframe (example):

import pandas as pd
import numpy as np

df_train = pd.read_csv('train.csv',names = ['sentence','target'])
df_test = pd.read_csv('test.csv', names = ['sentence','target'])

df_train['target'] = df_train['target'].astype(np.float16)
df_test['target'] = df_test['target'].astype(np.float16)

df_train.to_csv("train.csv",index=False)
df_test.to_csv("test.csv",index=False)

Then finetune pretrained BERT

python3 finetune_bert.py --downstream_task "sequence classification" \
                         --bert_model "./models7" \
                         --dataset "." \
                         --tokenizer "./tokenizer"

Increase The Performance Of Finetuning (Sequence Labeling)

Prepare your dataframe (example):

from datasets import load_dataset
dataset = load_dataset("wikiann", "tr")

ner_encoding = {0: "O", 1: "B-PER", 2: "I-PER", 3: "B-ORG", 4: "I-ORG", 5: "B-LOC", 6: "I-LOC"}


train_tokens = []
train_tags = []
for sample in dataset["train"]:
  train_tokens.append(' '.join(sample["tokens"]))
  train_tags.append(' '.join([ner_encoding[a] for a in sample["ner_tags"]]))

test_tokens = []
test_tags = []
for sample in dataset["test"]:
  test_tokens.append(' '.join(sample["tokens"]))
  test_tags.append(' '.join([ner_encoding[a] for a in sample["ner_tags"]]))

df_train = pd.DataFrame({"sentence": train_tokens, "tags": train_tags})
df_test = pd.DataFrame({"sentence": test_tokens, "tags": test_tags})

df_train.to_csv("train.csv", index=False)
df_test.to_csv("test.csv", index=False)

Then finetune pretrained BERT

python3 finetune_bert.py --downstream_task "sequence labeling" 
                         --bert_model "./models7" \
                         --dataset "." \
                         --tokenizer "./tokenizer"

Authors

  • M. Şafak Bilici
  • Mehmet Fatih Amasyali

About

Code for our paper "Variational Sentence Augmentation for Masked Language Modeling", ASYU 2021.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages