GitHub - utkarsh-21st/Email-Classification-and-Deployment: Multi-Label Text Classification by fine-tuning BERT and XLNet and deployment using Flask

Email Clasification

Classification using DistilBERT and XLNet

The email-dataset (.msg files) that I originally used in this project are confidential and thus can't be open-sourced, so in the jupyter-notebook, I've used a text-complaint dataset, which is stored in a CSV file, to train the model. Here is the link for that - Click.

If you do have .msg files, you may train the model accordingly as per the instructions provided in the notebook. Go through the Jupyter Notebook and train the model.

Having trained a model, let's deploy it using Flask: The below deployment provides a UI to browse and read .msg files, classifies them, and place each file to its predicted folder Note: It currently implements DistilBERT in its backend however you can make relevant alterations to the tasks.py to make it run for XLNet.

Deployment using Flask

How to run?

Install dependencies: See requirements.txt
Clone this repository

git clone https://github.com/utkarsh-21st/Email-Classification-and-Deployment.git

After going through the jupyter-nb, you would have 2 files saved - model and model_data. Put the saved model file into the app/model directory and the model_data folder to the app directory.
Now fire up a terminal, cd to your project directory, and run the below command:

python main.py

Head over to the URL (localhost) which would be displayed in your Terminal window. (similar to http://127.0.0.1:5000/)

Overview Of Different Approaches for Text Classification [January 2021]

Bag of Words representation, which was much used before the advent of Deep Learning, approaches the problem in the following way: Having a dataset of sequences, it first creates a vocabulary (unique-words) out of it. Then a numerical feature-vector is generated for every text-sequence. The feature-vector is an array of size of vocabulary, where each element represents the occurrences (in that sequence) of the corresponding word in the vocabulary. Then all the vectors and corresponding class labels are fed to a classification algorithm such as SVM or an ensemble-based like Gradient Boosting or a Neural Network. The features could be improved by re-weighting them with what’s called Tf–idf (Term Frequency Inverse Document Matrix). It turns out that word-embeddings learned through Neural Networks perform much better than the above mentioned. Next is the idea of word-embeddings which is also a numerical vector representation for a word. Two famous ones are word2vec and GloVe. These are learned embeddings that take into consideration occurrence and co-occurrence information as well. They also capture semantic-similarity between words and are known to perform better than ‘simple’ Bag of Words representation methods. RNN (Recurrent Neural Network) is currently one of the best approaches to tackle sequence-based problems. The major advantage is that they maintain a state for keeping track of context while reading the sequence in a sequential manner. When using RNN’s for this task, we can use pre-trained embeddings, word2vec, GloVe, by representing each word by these embeddings. Particularly, LSTM (Long Short Term Memory), which is a kind of RNN can be used for this task, even better, BiDirectional LSTM, which reads the sequence from either side can be used. Next, there is BERT (Bidirectional Encoder Representations from Transformers) which is currently one of the best models out there. It uses a transformer, an attention mechanism that learns contextual relations between words in a sequence. It is a Deep Neural Network consist of hundreds of millions of parameters so training it from scratch requires enormous computational resources as well as a huge dataset. So we are going to make use of a pre-trained model and fine-tune it for our specific task.

Finally, there is XLNet, which is the current state-of-the-art model on various NLP tasks. XLNet is an extension of the Transformer-XL model. We are going to make use of a pre-trained model and fine-tune it for our specific task.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
app		app
readme.md		readme.md
requirements.txt		requirements.txt
run.py		run.py
text_classification.ipynb		text_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

.ipynb_checkpoints

.ipynb_checkpoints

app

app

readme.md

readme.md

requirements.txt

requirements.txt

run.py

run.py

text_classification.ipynb

text_classification.ipynb

Repository files navigation

Email Clasification

Classification using DistilBERT and XLNet

Deployment using Flask

How to run?

Overview Of Different Approaches for Text Classification [January 2021]

About

Releases

Packages

Languages

utkarsh-21st/Email-Classification-and-Deployment

Folders and files

Latest commit

History

Repository files navigation

Email Clasification

Classification using DistilBERT and XLNet

Deployment using Flask

How to run?

Overview Of Different Approaches for Text Classification [January 2021]

About

Topics

Resources

Stars

Watchers

Forks

Languages