ACKNOWLEDGEMENT

Note: For most of the experimentation, we adapted various official github repos. You can find them in Reference section. That being said below are the scripts that we either wrote from scratch or adapted an existing script and upgraded it.

GenderBiasAnalysis/TinyBERT/FT_Bert_Classification.py
GenderBiasAnalysis/TinyBERT/task_distill.py
GenderBiasAnalysis/TinyBERT/bias_analysis.py
GenderBiasAnalysis/TinyBERT/result.ipynb
GenderBiasAnalysis/imdbtests/res_data/IMDB_data_preparation_script.py
GenderBiasAnalysis/imdbtests/rate.py
GenderBiasAnalysis/imdbtests/res_plots/biases.ipynb
GenderBiasAnalysis/imdbtests/res_plots/tables.ipynb
GenderBiasAnalysis/TinyBERT/seat_analysis.ipynb
GenderBiasAnalysis/TinyBERT/seat_analysis.py
GenderBiasAnalysis/TinyBERT/seat_bert_encoder.ipynb

INTRODUCTION

Note: We used bert base uncased as our teacher model for experimentation. You can find the github repo of this project at [https://github.com/HemaDevaSagar35/GenderBiasAnalysis]. Also we are sharing google drive link, below, where we uploaded our models and data (that are too huge to put in canvas or github)

There are 2 facets here:

Training Bert and TinyBert models on MLMA and IMDB datasets
Doing various bias analysis with the models obtained from step 1.

You could Skip step 1 and run scripts responsible for Step 2 using the models we generated from our experiments. You can find the models we generated and relevent data here [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link]

But incase you want to run Step 1 and re-generate the models, below are the things you have to do:

HOW TO RUN STEP 1

pre-requisites

Go to GenderBiasAnalysis/TinyBERT/ and run

pip install -r requirements.txt

Download GloVe embedding from here [https://nlp.stanford.edu/data/glove.6B.zip]
Unzip it in GenderBiasAnalysis/TinyBERT/embeddings folder
Download bert-base-uncased folder from [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link] and place it in GenderBiasAnalysis/TinyBERT/
Download tinybert-gkd-model folder from [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link] and place it in GenderBiasAnalysis/TinyBERT/
Download the glue_data folder from [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link] and place them in GenderBiasAnalysis/data/

Training Hate Speech model on MLMA

Change the working directory to GenderBiasAnalysis/TinyBERT/
Fine Tune Bert on MLMA dataset with the following command

python FT_Bert_Classification.py --data_dir ../data/glue_data/MLMA \
                                       --pre_trained_bert bert-base-uncased \
                                       --task_name MLMA \
                                       --do_lower_case \
                                       --output_dir output_models \
                                       --num_train_epochs 30

Do intermediate distillation of TinyBERT on MLMA using the following command

python task_distill.py --teacher_model output_models \
                       --student_model tinybert-gkd-model \
                       --data_dir ../data/glue_data/MLMA \
                       --task_name MLMA \
                       --output_dir tiny_temp_model \
                       --max_seq_length 128 \
                       --train_batch_size 32 \
                       --num_train_epochs 20 \
                       --aug_train \
                       --do_lower_case

Do prediction layer distillation of TinyBERT on MLMA using the following command

python task_distill.py --pred_distill  \
                       --teacher_model output_models \
                       --student_model tiny_temp_model \
                       --data_dir ../data/glue_data/MLMA \
                       --task_name MLMA \
                       --output_dir tinybert_model \
                       --aug_train \
                       --do_lower_case \
                       --learning_rate 3e-5  \
                       --num_train_epochs  3  \
                       --max_seq_length 64 \
                       --train_batch_size 32

Training Sentiment model on IMDB

Change the working directory to GenderBiasAnalysis/TinyBERT/
Fine tune bert on IMDB dataset using the following comamand

python FT_Bert_Classification.py --data_dir ../data/glue_data/IMDB \
                                     --pre_trained_bert bert-base-uncased \
                                     --task_name IMDB \
                                     --do_lower_case \
                                     --output_dir imdb_output_models \
                                     --num_train_epochs 30

Do intermediate distillation of TinyBERT on IMDB using the following command

python task_distill.py --teacher_model imdb_output_models \
                       --student_model tinybert-gkd-model \
                       --data_dir ../data/glue_data/IMDB \
                       --task_name IMDB \
                       --output_dir tiny_temp_imdb_model \
                       --max_seq_length 128 \
                       --train_batch_size 32 \
                       --num_train_epochs 20 \
                       --do_lower_case

Do prediction layer distillation of TinyBERT on IMDB using the following command

python task_distill.py --pred_distill  \
                       --teacher_model imdb_output_models \
                       --student_model tiny_temp_imdb_model \
                       --data_dir ../data/glue_data/IMDB \
                       --task_name IMDB \
                       --output_dir tinybert_imdb_model \
                       --do_lower_case \
                       --learning_rate 3e-5  \
                       --num_train_epochs  3  \
                       --max_seq_length 64 \
                       --train_batch_size 32

HOW TO RUN STEP 2

Like mentioned at the start, you can run step 2 either using our models directly or by first running Step 1 and re-generating the models. For convinience we are categorizing based on the analysis we did

Unintended Bias

If you want to run this analysis in step 2 directly using our models, please do the following pre-requisites first

Download the folder output_models, tinybert_model, imdb_output_models, tinybert_imdb_model from here [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link] and place them in GenderBiasAnalysis/TinyBERT/
Download the glue_data folder from [https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S?usp=share_link] and place them in GenderBiasAnalysis/data/ NOTE: The above are required if you want to run this analysis in Step 2 directly using our models.

Now following are the codes you have to run to get all the results under this analysis in Step 2.

Run the ipython notebook named GenderBiasAnalysis/TinyBERT/analysis.ipynb. This ipython notebook is self explanatory and does all the analysis that was presented in the report and presentation slides.

Gender Bias

First, place the models tinybert_imdb_model and imdb_output_models in the "GenderBiasAnalysis/imdbtests/res_models/models" folder (download them from here https://drive.google.com/drive/folders/1XmLXSMbYAur1mZfqGJfmUaTQGa8BuX1S) rename them as 'IMDB_tinybert_original' and 'imdb_bertbase_original' respectively.

Change the working directory to GenderBiasAnalysis/ Then run the following command

python imdbtests/res_data/IMDB_data_preparation_script.py | tee imdbtests/data_prep.txt

Then run the following command to get your bias calculations for specs in GenderBiasAnalysis/imdbtests/res_results folder

python -c 'import imdbtests.rate; rate.rate()'

After this , you see the results in GenderBiasAnalysis/imdbtests/res_results folder

use GenderBiasAnalysis/imdbtests/res_plots/biases.ipynb and GenderBiasAnalysis/imdbtests/res_plots/tables.ipynb for consolidating and getting results for biases for model of your choice in table and picture format

SEAT Scoring

Run the 2 steps mentioned in Unintended Bias section. Run the ipython notebook named GenderBiasAnalysis/TinyBERT/seat_analysis.ipynb. Notebook is self explanatory. SEAT test, results and plot can be found in SEAT folder

Log Probability Bias Score

Follow steps 1 and 2 mentioned in Unintended Bias section. To run log probability bias tests use the following command

First, Change the working directory to GenderBiasAnalysis/

python TinyBERT/log_probability_bias_analysis.py 
    --eval Log_Probability_Bias/Corpus_Creation/BEC-Pro/BEC-Pro_EN.tsv 
    --model [MODEL PATH] 
    --out Log_Probability_Bias/results/[SAMPLE NAME].csv

MODEL PATH for the 4 models would be TinyBERT/output_models, TinyBERT/imdb_output_models, TinyBERT/tinybert_model, TinyBERT/tinybert_imdb_model. To run for all the four models, replace MODEL PATH with each model path individually and then run. Also replace [SAMPLE NAME].csv, with the name you want to store the file with.

Sample command:

python TinyBERT/log_probability_bias_analysis.py  
    --eval Log_Probability_Bias/Corpus_Creation/BEC-Pro/BEC-Pro_EN.tsv  
    --model TinyBERT/tinybert_imdb_model 
    --out Log_Probability_Bias/results/tinybert_result.csv

Results and other resources related to log probability tests can be found in Log_Probability_Bias folder

Categorical Bias Score

First, change the working directory to GenderBiasAnalysis/categorical_bias_score/ Then do

pip install -r requirements.txt

Evaluation script Add the pretrained model using the below command and change the name of the bert_config file to config.json as the code is testing using the open source transformers

categorical_score.py --lang en --custom_model_path [MODEL PATH]

REFERENCE

https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT
https://github.com/sciphie/bias-bert
https://github.com/W4ngatang/sent-bias
https://github.com/marionbartl/gender-bias-BERT\ https://github.com/jaimeenahn/ethnic_bias

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Log_Probability_Bias		Log_Probability_Bias
SEAT		SEAT
TinyBERT		TinyBERT
categorical_bias_score		categorical_bias_score
data		data
imdbtests		imdbtests
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
Abstract.txt		Abstract.txt
README.md		README.md
working_links.txt		working_links.txt

HemaDevaSagar35/GenderBiasAnalysis

Folders and files

Latest commit

History

Repository files navigation

ACKNOWLEDGEMENT

INTRODUCTION

HOW TO RUN STEP 1

pre-requisites

Training Hate Speech model on MLMA

Training Sentiment model on IMDB

HOW TO RUN STEP 2

Unintended Bias

Gender Bias

SEAT Scoring

Log Probability Bias Score

Categorical Bias Score

REFERENCE

About

Topics

Resources

Stars

Watchers

Forks

Languages