Skip to content

Ishan-Kumar2/Reddit-Flair-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-Flair-Detector

This model aims to detect the flair of a Reddit post from '/india' subreddit. The model classifies the flair of a post given its URL into Coronavirus, Nonpolitical, Political and others. The model can also be hosted as a web service.

Installation

First clone the repo to the local device using

git clone https://github.com/Ishan-Kumar2/Reddit-Flair-Detector.git

This project requires

The dependencies required are given in requirements.txt To install the requirements

pip install -r requirements

Data Acquistion

The data can be loaded in the form of a CSV Reddit's API PRAW. The CSV file for the dataset used for training this model is present in dataset folder as train.csv and val.csv. The process of extracting the data and applying basic processing is present in Data_Acquistion.ipynb

EDA

Exploratory data anylsis can be found in EDA.ipynb. Using this analysis, the features to be used were decided and certain flairs were removed due to lack of Data. The final data consists of around 450 posts.

Model

Model with Title

In this example I decided to use a simple LSTM on the title. The data was preprocessed and tokenised, followed by converting to pretrained word embeddings. for word embeddings I decided to go with GloVe 50d. Further a model replacing LSTM with Bi-LSTM was also used as it allows context from both sides.

BiLSTM LSTM Cell

Model with Context, Title

In this example in addition to the Bi-LSTM model for th title I decided to also use the context(body) of the post. Since the body of the post can be as large as 14k words long, using a sequential model like LSTM would be very compute expensive. Hence I decided to use fastText as proposed in Bag of Tricks for Efficient Text Classification.

Fasttext model

Implementation Details

  • Pretrained word embedding (GloVe 50d and fasttext simple 300d)
  • Average word embedding for the entire sentence
  • Feed through a feed forward network.
  • Loss function Negative log likelihood is used.
  • Optimizer Adam for training.

Seq2seq Model with Attention

With the intuition that certain keywords would be extremely essential in classfiying the post to a certain flair, I decided to use Attention mechanism on top of the BiLSTM and conctaenated the output with the final hidden state for classification. The reason I did this was for example in a title with a keyword like coronavirus at the start of the sentence there is a high chance that the final hidden state has small contribution of that, thereby potentially leading to misclassifying it.

Attention mech

Implementation Details

  • Pretrained word embedding (GloVe 50d and fasttext simple 300d)
  • Single Layer BiLSTM
  • Optimizer- Adam for training
  • Loss function Negative Log likelihood

Seq2seq and fastText model

In this attempt I decided to concatenate the output of the fastText model for context and the BiLSTM model. In addition I also used features number_comments(Number of commments) and Score(score of the post), reason in EDA. These features were first passed through a feed forward layer. Output was concatenated with that of title model and context model.

Attention

Implementation Details

  • Pretrained word embedding (GloVe 50d and fasttext simple 300d)
  • Single Layer BiLSTM
  • Attention applied between final hidden state and all hidden state of LSTM
  • Attention between context and final hidden state
  • Concatenation of above vectors and the output of number of comments and score model output
  • Optimizer- Adam for training
  • Loss function Negative Log likelihood

Results

The loss progressively decreases with number of epochs. Also the number of correct classification increases with epochs. Result

Deploying

The final model used was the seq2seq, fastText combination model. Since this was built on PyTorch and the model itself was pretty large, the cumulative size exceeded the limit of Heroku. Although the webapp can be built locally using flask

Also since the torchtext Fields use lambda functions they cant be saved using pickle, hence I have made a model without torchtext also which is the one loaded by default on the webapp.

cd WebApp
export FLASK_APP=app.py
flask run

Then copy and paste the URL on a browser. To make it run the credientials for PRAW also have to be added.

Automated testing

The webapp can be tested automatically using the /automated_testing method. To do the following add the links to the reddit posts in file.txt, on each line. Then on running the flask app

http://127.0.0.1:5000/automated_testing

The output will be stored in sample.json in JSON format

Future Work

  • [1.] Byte Pair Encoding- Since there are many Out of vocabulary words in the corpus like(COVID-19,coronavirus), I decided to finetune the embedding. The performance should still be compared to BPE as that is not affected by OOV words.
  • [2.] Transformers- Using BERT for classifying both title and model class.
  • [3.] ElMo-Contextual embedding.
  • [4.] Text CNN- Using a Text CNN model in place of fastText for the context model

References

About

This is an attempt to detect the flair of a Reddit post.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages