Skip to content

manutej/reddit_scraping_classification

Repository files navigation

Reddit Web Scraping and Subreddit ML Classification

Description

The goal for this project was to develop a classification model using Natural Language data from a publicly available forum data source Reddit. The data was first scraped from the Reddit database using PushShift API, then Exploratory Data Analysis (EDA) was performed, and finally classification models were built related to the chosen subreddits.

File Structure

In the file structure, there exists two folders: data and images. Data contains the output and any intermediary csv's between the various stages of the process. The images folder contains the output images of the Exploratory Data Analysis. 3 main notebooks are:

  1. reddit data scraping
  2. reddit eda (exploratory data analysis)
  3. reddit classification modeling

Conclusions:

Machine learning Classification is a valid way to distinguish corpus text (using NLP). From our dataset and specific modeling used for binary classification in this context, Logistic regression is a better model for the situation at hand for targeted messaging for CS hiring candidates.

Releases

No releases published

Packages

No packages published