GitHub - davidsmandrade/StackedDeBERT: Stacked Denoising BERT for Noisy Text Classification (Neural Networks 2020)

About

Repository for paper titled "Stacked DeBERT: All Attention in Incomplete Data for Text Classification".

Overview

Proposed model: Stacked Denoising BERT
Task: Text classification from noisy data
- Twitter Sentiment Classification
- Incomplete Intent Classification: text with STT error
Baseline models
- BERT
- NLU Platforms: Rasa (spacy and tf), Dialogflow, and SAP Conversational AI
- Semantic Hashing with Classifier

Requirements

Python 3.6 (3.7.3 tested), PyTorch 1.0.1.post2, CUDA 9.0 or 10.1

pip install --default-timeout=1000 torch==1.0.1.post2
pip install -r requirements.txt
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

How to Use

1. Dataset

Chatbot NLU Evaluation Benchmark dataset with missing/incorrect data (STT errors) and Twitter Sentiment dataset (Check Dataset README)
Training done on:
- Twitter dataset: complete data, incomplete data, complete+incomplete data
- Chatbot dataset: complete data, 2 TTS-STT data (gtts-witai, macsay-witai)

2. Pre-fine-tune BERT

Twitter Sentiment Corpus

CUDA_VISIBLE_DEVICES=0,1 ./scripts/twitter_sentiment/run_bert_classifier_inc_with_corr.sh

Script for Inc+Corr dataset. Scripts corresponding to Inc and Corr are also available in the same folder.

Chatbot Incomplete Intent Corpus: texts with STT Error

CUDA_VISIBLE_DEVICES=0,1 ./scripts/stterror_intent/run_bert_classifier_stterror.sh

Script for noisy data (stterror). Script for clean, non-noisy data, is also available (complete).

3. Train/test model

Training on Twitter Corpus

CUDA_VISIBLE_DEVICES=0,1 ./scripts/twitter_sentiment/run_stacked_debert_dae_classifier_twitter_inc_with_corr.sh

Make sure the OUTPUT directory is the same as the fine-tuned BERT or copy the BERT model to your new output dir.

Training on NLU Evaluation Corpora for TTS=gtts/macsay and STT=witai and autoencoder epochs 100-1000.

CUDA_VISIBLE_DEVICES=0,1 ./scripts/stterror_intent/run_stacked_debert_dae_classifier_stterror.sh

4. Test model

Testing on NLU Evaluation Corpora

CUDA_VISIBLE_DEVICES=0 python run_stacked_debert_dae_classifier.py --seed 1 --task_name "sentiment140_sentiment" --save_best_model --do_eval --do_lower_case --data_dir ./data/twitter_sentiment_data/sentiment140/ --bert_model bert-base-uncased --max_seq_length 128 --train_batch_size 4 --eval_batch_size 1 --learning_rate 2e-5 --num_train_epochs_autoencoder 3 --num_train_epochs 3 --output_dir_first_layer "./results/test/results_stacked_debert_dae_earlyStopWithEvalLoss_twitter_10seeds/inc_with_corr_sentences_TestOnlyIncorrect/sentiment140_ep3_bs4_inc_with_corr_TestOnlyIncorrect_seed1_first_layer_epae1000/" --output_dir "./results/test/results_stacked_debert_dae_earlyStopWithEvalLoss_twitter_10seeds/inc_with_corr_sentences_TestOnlyIncorrect/sentiment140_ep3_bs4_inc_with_corr_TestOnlyIncorrect_seed1_second_layer_epae1000/"

Acknowledgment

In case you wish to use this code, please use the following citation:

@article{CUNHASERGIO202187,
    title = "Stacked DeBERT: All attention in incomplete data for text classification",
    author = "Gwenaelle {Cunha Sergio} and Minho Lee",
    journal = "Neural Networks",
    volume = "136",
    pages = "87 - 96",
    year = "2021",
    issn = "0893-6080",
    doi = "https://doi.org/10.1016/j.neunet.2020.12.018",
    url = "http://www.sciencedirect.com/science/article/pii/S0893608020304433"
}

Email for further requests or questions: gwena.cs@gmail.com

The authors would like to thank Snips.co and Kaggle for their public datasets (Snips NLU Benchmark and Sentiment140 Twitter Dataset), and HuggingFace's BERT PyTorch code.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
baseline		baseline
data		data
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_results_mean_std.py		get_results_mean_std.py
get_results_mean_std_testWithIncompleteData.py		get_results_mean_std_testWithIncompleteData.py
plot_confusion_matrix.py		plot_confusion_matrix.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_stacked_debert.ipynb		run_stacked_debert.ipynb
run_stacked_debert_dae_classifier.py		run_stacked_debert_dae_classifier.py
utils.py		utils.py

License

davidsmandrade/StackedDeBERT

Folders and files

Latest commit

History

Repository files navigation

About

Contents

Overview

Requirements

How to Use

1. Dataset

2. Pre-fine-tune BERT

3. Train/test model

4. Test model

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Languages