Adversarial embeddings for BERT

Adversarial embedding generation and analysis on top of BERT for sentiment classification on IMDB Large Move Review Dataset. Built on top of the Chainer reimplementation of the Google Research's original TensorFlow implementation. IMDB loader and processor functions taken from this branch.

Repository also includes an algorithm for projecting the adversarial embeddings to obtain adversarial discrete text candidates. Although the algorithm employs simple heuristics to make small and allowed changes, the meaning of the sentence may change as the adversary usually targets sentiment carrying tokens.

Related Work

Interpretable Adversarial Training for Text by Barham and Feizi
Interpretable Adversarial Perturbation in Input Embedding Space for Text by Sato et al.
Adversarial Training Methods for Semi-Supervised Text Classification by Miyato et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin et al.

Requirements

Python (3.6.4)
Chainer (6.0.0)
CuPy (6.1.0)

Installation

Install packages if they are not already present.

pip install cupy-cuda90 --no-cache-dir --user
pip install chainer --user

Clone and enter the repository.

# cd /cluster/scratch/nethzid
git clone https://github.com/dcetin/bert-chainer.git
cd bert-chainer
# module load python_cpu/3.6.4 cuda/9.0.176

Download and load the pretrained TensorFlow BERT checkpoints.

wget 'https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
unzip uncased_L-12_H-768_A-12.zip
export BERT_BASE_DIR=./uncased_L-12_H-768_A-12
python convert_tf_checkpoint_to_chainer.py \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --npz_dump_path $BERT_BASE_DIR/arrays_bert_model.ckpt.npz
rm uncased_L-12_H-768_A-12.zip

Download and extract the IMDB dataset.

wget 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
tar -xzf aclImdb_v1.tar.gz
python create_imdb_dataset.py
rm aclImdb_v1.tar.gz

Download the model checkpoint, if hasn't done before.

wget 'https://n.ethz.ch/~dcetin/download/model_snapshot_iter_2343_max_seq_length_128.npz' -P base_models

Usage

Example command (can be found in train_imdb.sh as well) to run the experiments block of the code. Change last four options accordingly for the desired usage.

# module load python_gpu/3.6.4 cuda/9.0.176
# bsub -n 4 -W 4:00 -R "rusage[mem=1024, ngpus_excl_p=1]" \
python run_classifier.py \
  --task_name=IMDB \
  --data_dir=aclImdb \
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_BASE_DIR/arrays_bert_model.ckpt.npz \
  --max_seq_length=128 \
  --train_batch_size=16 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --output_dir=./out_imdb \
  --do_train=false \
  --do_eval=false \
  --do_resume=true \
  --do_experiment=true \

Notes

basic directory contains code for adversarial embedding generation, evaluation and visualization on a network model of much smaller scale (i.e. a multi step LSTM and a linear layer as an encoder instead of BERT). It also implements adversarial training with an unrolled training loop. See also its own readme file.
Features the classifier utilize (i.e. output of the penultimate layer, pooled encodings) for all evaluation runs (standard and all four adversarial cases) can be found online. It is what the save_outputs function creates and writes.

wget 'https://n.ethz.ch/~dcetin/download/train_test_outputs.pickle'

Output the summary_statistics function dumps can also be found online. One can simply call summary_histogram or any other function on the sampled data.

wget 'https://n.ethz.ch/~dcetin/download/summary_data_10000_5_5.pickle'

Training/evaluation on GLUE tasks (e.g. MRPC) can be done as shown below, after downloading the TensorFlow BERT checkpoints. Be aware that some experimental functions are explicitly written for IMDB dataset and may not work or work in unintended ways for other tasks.

# module load python_cpu/3.6.4 cuda/9.0.176
wget "https://n.ethz.ch/~dcetin/download/download_glue_data.py"
python download_glue_data.py
export GLUE_DIR=./glue_data

# module load python_gpu/3.6.4 cuda/9.0.176
# bsub -n 6 -W 4:00 -R "rusage[mem=1024, ngpus_excl_p=1]" -R "select[gpu_model0==TeslaV100_SXM2_32GB]" \
python run_classifier.py \
  --task_name MRPC \
  --data_dir $GLUE_DIR/MRPC/ \
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_BASE_DIR/arrays_bert_model.ckpt.npz \
  --max_seq_length 128 \
  --train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir=./mrpc_output
  --do_train True \
  --do_eval True \
  --do_lower_case True \

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
basic		basic
bert-tf		bert-tf
docs/img		docs/img
to_be_modified		to_be_modified
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
convert_tf_checkpoint_to_chainer.py		convert_tf_checkpoint_to_chainer.py
create_imdb_dataset.py		create_imdb_dataset.py
extract_features.py		extract_features.py
modeling.py		modeling.py
optimization.py		optimization.py
run_classifier.py		run_classifier.py
run_squad.py		run_squad.py
sample_text.txt		sample_text.txt
tokenization.py		tokenization.py
tokenization_test.py		tokenization_test.py
train_imdb.sh		train_imdb.sh
utils.py		utils.py
visualize.py		visualize.py

dcetin/bert-adv-embed

Folders and files

Latest commit

History

Repository files navigation

Adversarial embeddings for BERT

Related Work

Requirements

Installation

Usage

Notes

About

Topics

Resources

Stars

Watchers

Forks

Languages