Skip to content

A Reddit Flair Detector using Machine Learning Algorithms

Notifications You must be signed in to change notification settings

radonys/Reddit-Flair-Detector

Repository files navigation

Reddit Flair Detector

A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.

Directory Structure

The directory is a Django web application set-up for hosting on Heroku servers. The description of files and folders can be found below:

  1. manage.py - The file used to start Django server.
  2. requirements.txt - Containing all Python dependencies of the project.
  3. nltk.txt - Containing all NLTK library needed dependencies.
  4. Procfile - Needed to setup Heroku.
  5. website - Folder containing the master settings of Django application.
  6. templates - Folder containing HTML/CSS files.
  7. flair-detector - Folder containing the main application which loads the Machine Learning models and renders the results on the web application.
  8. data - Folder containing CSV and MongoDB instances of the collected data.
  9. Models - Folder containing the saved model.
  10. Jupyter Notebooks - Folder containing Jupyter Notebooks to collect Reddit India data and train Machine Learning models. Notebooks can be opened in Colaboratory by Google.

Codebase

The entire code has been developed using Python programming language, utilizing it's powerful text processing and machine learning modules. The application has been developed using Django web framework and hosted on Heroku web server.

Project Execution

  1. Open the Terminal.
  2. Clone the repository by entering git clone https://github.com/radonys/Reddit-Flair-Detector.git.
  3. Ensure that Python3 and pip is installed on the system.
  4. Create a virtualenv by executing the following command: virtualenv -p python3 env.
  5. Activate the env virtual environment by executing the follwing command: source env/bin/activate.
  6. Enter the cloned repository directory and execute pip install -r requirements.txt.
  7. Enter python shell and import nltk. Execute nltk.download('stopwords') and exit the shell.
  8. Now, execute the following command: python manage.py runserver and it will point to the localhost with the port.
  9. Hit the IP Address on a web browser and use the application.

Dependencies

The following dependencies can be found in requirements.txt:

  1. praw
  2. scikit-learn
  3. nltk
  4. Django
  5. bs4
  6. pandas
  7. numpy

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using [2] which described various machine learning models like Naive-Bayes, Linear SVM and Logistic Regression for text classification with code snippets. Along with this, I tried other models like Random Forest and Multi-Layer Perceptron for the task. I have obtained test accuracies on various scenarios which can be found in the next section.

The approach taken for the task is as follows:

  1. Collect 100 India subreddit data for each of the 12 flairs using praw module [1].
  2. The data includes title, comments, body, url, author, score, id, time-created and number of comments.
  3. For comments, only top level comments are considered in dataset and no sub-comments are present.
  4. The title, comments and body are cleaned by removing bad symbols and stopwords using nltk.
  5. Five types of features are considered for the the given task:
a) Title
b) Comments
c) Urls
d) Body
e) Combining Title, Comments and Urls as one feature.
  1. The dataset is split into 70% train and 30% test data using train-test-split of scikit-learn.
  2. The dataset is then converted into a Vector and TF-IDF form.
  3. Then, the following ML algorithms (using scikit-learn libraries) are applied on the dataset:
a) Naive-Bayes
b) Linear Support Vector Machine
c) Logistic Regression
d) Random Forest
e) MLP
  1. Training and Testing on the dataset showed the Random Forest showed the best testing accuracy of 77.97% when trained on the combination of Title + Comments + Url feature.
  2. The best model is saved and is used for prediction of the flair from the URL of the post.

Results

Title as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.6011904762
Linear SVM 0.6220238095
Logistic Regression 0.6339285714
Random Forest 0.6160714286
MLP 0.4970238095

Body as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.2083333333
Linear SVM 0.2470238095
Logistic Regression 0.2619047619
Random Forest 0.2767857143
MLP 0.2113095238

URL as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.3005952381
Linear SVM 0.3898809524
Logistic Regression 0.3690476190
Random Forest 0.3005952381
MLP 0.3214285714

Comments as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.5357142857
Linear SVM 0.6190476190
Logistic Regression 0.6220238095
Random Forest 0.6011904762
MLP 0.4761904762

Title + Comments + URL as Feature

Machine Learning Algorithm Test Accuracy
Naive Bayes 0.6190476190
Linear SVM 0.7529761905
Logistic Regression 0.7470238095
Random Forest 0.7797619048
MLP 0.4940476190

Intuition behind Combined Feature

The features independently showed a test accuracy near to 60% with the body feature giving the worst accuracies during the experiments. Hence, it was excluded in the combined feature set.

References

  1. How to scrape data from Reddit
  2. Multi-Class Text Classification Model Comparison and Selection