Skip to content

Stacked Denoising BERT for Noisy Text Classification (Neural Networks 2020)

License

Notifications You must be signed in to change notification settings

davidsmandrade/StackedDeBERT

 
 

Repository files navigation

About

Repository for paper titled "Stacked DeBERT: All Attention in Incomplete Data for Text Classification".

Contents

OverviewRequirementsHow to UseHow to Cite

Overview

Requirements

Python 3.6 (3.7.3 tested), PyTorch 1.0.1.post2, CUDA 9.0 or 10.1

pip install --default-timeout=1000 torch==1.0.1.post2
pip install -r requirements.txt
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

How to Use

1. Dataset

  • Chatbot NLU Evaluation Benchmark dataset with missing/incorrect data (STT errors) and Twitter Sentiment dataset (Check Dataset README)
  • Training done on:
    • Twitter dataset: complete data, incomplete data, complete+incomplete data
    • Chatbot dataset: complete data, 2 TTS-STT data (gtts-witai, macsay-witai)

2. Pre-fine-tune BERT

  • Twitter Sentiment Corpus
CUDA_VISIBLE_DEVICES=0,1 ./scripts/twitter_sentiment/run_bert_classifier_inc_with_corr.sh

Script for Inc+Corr dataset. Scripts corresponding to Inc and Corr are also available in the same folder.

  • Chatbot Incomplete Intent Corpus: texts with STT Error
CUDA_VISIBLE_DEVICES=0,1 ./scripts/stterror_intent/run_bert_classifier_stterror.sh

Script for noisy data (stterror). Script for clean, non-noisy data, is also available (complete).

3. Train/test model

  • Training on Twitter Corpus
CUDA_VISIBLE_DEVICES=0,1 ./scripts/twitter_sentiment/run_stacked_debert_dae_classifier_twitter_inc_with_corr.sh

Make sure the OUTPUT directory is the same as the fine-tuned BERT or copy the BERT model to your new output dir.

  • Training on NLU Evaluation Corpora for TTS=gtts/macsay and STT=witai and autoencoder epochs 100-1000.
CUDA_VISIBLE_DEVICES=0,1 ./scripts/stterror_intent/run_stacked_debert_dae_classifier_stterror.sh

4. Test model

  • Testing on NLU Evaluation Corpora
CUDA_VISIBLE_DEVICES=0 python run_stacked_debert_dae_classifier.py --seed 1 --task_name "sentiment140_sentiment" --save_best_model --do_eval --do_lower_case --data_dir ./data/twitter_sentiment_data/sentiment140/ --bert_model bert-base-uncased --max_seq_length 128 --train_batch_size 4 --eval_batch_size 1 --learning_rate 2e-5 --num_train_epochs_autoencoder 3 --num_train_epochs 3 --output_dir_first_layer "./results/test/results_stacked_debert_dae_earlyStopWithEvalLoss_twitter_10seeds/inc_with_corr_sentences_TestOnlyIncorrect/sentiment140_ep3_bs4_inc_with_corr_TestOnlyIncorrect_seed1_first_layer_epae1000/" --output_dir "./results/test/results_stacked_debert_dae_earlyStopWithEvalLoss_twitter_10seeds/inc_with_corr_sentences_TestOnlyIncorrect/sentiment140_ep3_bs4_inc_with_corr_TestOnlyIncorrect_seed1_second_layer_epae1000/"

Acknowledgment

In case you wish to use this code, please use the following citation:

@article{CUNHASERGIO202187,
    title = "Stacked DeBERT: All attention in incomplete data for text classification",
    author = "Gwenaelle {Cunha Sergio} and Minho Lee",
    journal = "Neural Networks",
    volume = "136",
    pages = "87 - 96",
    year = "2021",
    issn = "0893-6080",
    doi = "https://doi.org/10.1016/j.neunet.2020.12.018",
    url = "http://www.sciencedirect.com/science/article/pii/S0893608020304433"
}

Email for further requests or questions: gwena.cs@gmail.com

The authors would like to thank Snips.co and Kaggle for their public datasets (Snips NLU Benchmark and Sentiment140 Twitter Dataset), and HuggingFace's BERT PyTorch code.

About

Stacked Denoising BERT for Noisy Text Classification (Neural Networks 2020)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 81.6%
  • Jupyter Notebook 12.5%
  • Shell 5.9%