Reddit-Flair-Detector

This model aims to detect the flair of a Reddit post from '/india' subreddit. The model classifies the flair of a post given its URL into Coronavirus, Nonpolitical, Political and others. The model can also be hosted as a web service.

Installation

First clone the repo to the local device using

git clone https://github.com/Ishan-Kumar2/Reddit-Flair-Detector.git

This project requires

The dependencies required are given in requirements.txt To install the requirements

pip install -r requirements

Data Acquistion

The data can be loaded in the form of a CSV Reddit's API PRAW. The CSV file for the dataset used for training this model is present in dataset folder as train.csv and val.csv. The process of extracting the data and applying basic processing is present in Data_Acquistion.ipynb

EDA

Exploratory data anylsis can be found in EDA.ipynb. Using this analysis, the features to be used were decided and certain flairs were removed due to lack of Data. The final data consists of around 450 posts.

Model

Model with Title

In this example I decided to use a simple LSTM on the title. The data was preprocessed and tokenised, followed by converting to pretrained word embeddings. for word embeddings I decided to go with GloVe 50d. Further a model replacing LSTM with Bi-LSTM was also used as it allows context from both sides.

Model with Context, Title

In this example in addition to the Bi-LSTM model for th title I decided to also use the context(body) of the post. Since the body of the post can be as large as 14k words long, using a sequential model like LSTM would be very compute expensive. Hence I decided to use fastText as proposed in Bag of Tricks for Efficient Text Classification.

Implementation Details

Pretrained word embedding (GloVe 50d and fasttext simple 300d)
Average word embedding for the entire sentence
Feed through a feed forward network.
Loss function Negative log likelihood is used.
Optimizer Adam for training.

Seq2seq Model with Attention

With the intuition that certain keywords would be extremely essential in classfiying the post to a certain flair, I decided to use Attention mechanism on top of the BiLSTM and conctaenated the output with the final hidden state for classification. The reason I did this was for example in a title with a keyword like coronavirus at the start of the sentence there is a high chance that the final hidden state has small contribution of that, thereby potentially leading to misclassifying it.

Implementation Details

Pretrained word embedding (GloVe 50d and fasttext simple 300d)
Single Layer BiLSTM
Optimizer- Adam for training
Loss function Negative Log likelihood

Seq2seq and fastText model

In this attempt I decided to concatenate the output of the fastText model for context and the BiLSTM model. In addition I also used features number_comments(Number of commments) and Score(score of the post), reason in EDA. These features were first passed through a feed forward layer. Output was concatenated with that of title model and context model.

Implementation Details

Pretrained word embedding (GloVe 50d and fasttext simple 300d)
Single Layer BiLSTM
Attention applied between final hidden state and all hidden state of LSTM
Attention between context and final hidden state
Concatenation of above vectors and the output of number of comments and score model output
Optimizer- Adam for training
Loss function Negative Log likelihood

Results

The loss progressively decreases with number of epochs. Also the number of correct classification increases with epochs.

Deploying

The final model used was the seq2seq, fastText combination model. Since this was built on PyTorch and the model itself was pretty large, the cumulative size exceeded the limit of Heroku. Although the webapp can be built locally using flask

Also since the torchtext Fields use lambda functions they cant be saved using pickle, hence I have made a model without torchtext also which is the one loaded by default on the webapp.

cd WebApp
export FLASK_APP=app.py
flask run

Then copy and paste the URL on a browser. To make it run the credientials for PRAW also have to be added.

Automated testing

The webapp can be tested automatically using the /automated_testing method. To do the following add the links to the reddit posts in file.txt, on each line. Then on running the flask app

http://127.0.0.1:5000/automated_testing

The output will be stored in sample.json in JSON format

Future Work

[1.] Byte Pair Encoding- Since there are many Out of vocabulary words in the corpus like(COVID-19,coronavirus), I decided to finetune the embedding. The performance should still be compared to BPE as that is not affected by OOV words.
[2.] Transformers- Using BERT for classifying both title and model class.
[3.] ElMo-Contextual embedding.
[4.] Text CNN- Using a Text CNN model in place of fastText for the context model

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
WebApp		WebApp
dataset		dataset
notebooks		notebooks
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebApp

WebApp

dataset

dataset

notebooks

notebooks

utils

utils

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Reddit-Flair-Detector

Installation

Data Acquistion

EDA

Model

Model with Title

Model with Context, Title

Implementation Details

Seq2seq Model with Attention

Implementation Details

Seq2seq and fastText model

Implementation Details

Results

Deploying

Automated testing

Future Work

References

About

Releases

Packages

Contributors 2

Languages

Ishan-Kumar2/Reddit-Flair-Detector

Folders and files

Latest commit

History

Repository files navigation

Reddit-Flair-Detector

Installation

Data Acquistion

EDA

Model

Model with Title

Model with Context, Title

Implementation Details

Seq2seq Model with Attention

Implementation Details

Seq2seq and fastText model

Implementation Details

Results

Deploying

Automated testing

Future Work

References

About

Resources

Stars

Watchers

Forks

Languages