INFO6101-Spam-Classifier

Implementing a Spam Classifier using Naive Bayes and Spark

The datasets for this project can be downloaded from the below links:

1998 dataset - https://drive.google.com/open?id=1QtoxpJmd1lys7c7LaYXiOjbzMdMOpeVX
trec07p dataset - https://drive.google.com/open?id=1xaJL1eoccrCyS45xgF23dVY_KCER-oAD

The project is divided into three notebooks : Part-1 Structuring data: This part reads the text files containing the emails and parses all email fields like To, From, Subject, Body etc into a pandas dataframe and saves it as "structured.xlsx"

Part-2 Exploratory data analysis: This part deals with reading the "structured.xlsx" into a dataframe, exploring the data to find various features and get an insight into the data.

Part-3 Feature Extraction, Prediction: This part is used to extract the features from the dataset and train, test a Multinomial Naive Bayes model and calculate the model's accuracy.

Packages/Modules needed to run this project are : nltk, wordcloud, xlrd, numpy, pandas, spark, xlsxwriter, BeautifulSoup, matplotlib, findspark.

Spark 2.3.1 pre-built can be downloaded from here - https://drive.google.com/open?id=1iFVW0RxqL1VNrIOrfJXrDkPHF8ewXH49 To install spark, untar the downloaded file and set the SPARK_HOME environment variable to the spark-2.3.1-bin-hadoop2.7 folder.

To run the project:

Download and untar the datasets in the same directory as the python notebooks.
Make sure you have all dependent packages listed above installed.
Run Part_1_Structuring_data.ipynb. It will save an excel "structured.xlsx" in the directory.
Run Part_2_Exploratory_data_analysis.ipynb.
Run Part_3_Feature_Extraction,_Prediction.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Part_1_Structuring_data.ipynb		Part_1_Structuring_data.ipynb
Part_2_Exploratory_data_analysis.ipynb		Part_2_Exploratory_data_analysis.ipynb
Part_3_Feature_Extraction,_Prediction.ipynb		Part_3_Feature_Extraction,_Prediction.ipynb
README.md		README.md
Spam mail classifier - Final.pdf		Spam mail classifier - Final.pdf
Spam mail classifier - Proposal.pdf		Spam mail classifier - Proposal.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

Part_1_Structuring_data.ipynb

Part_1_Structuring_data.ipynb

Part_2_Exploratory_data_analysis.ipynb

Part_2_Exploratory_data_analysis.ipynb

Part_3_Feature_Extraction,_Prediction.ipynb

Part_3_Feature_Extraction,_Prediction.ipynb

README.md

README.md

Spam mail classifier - Final.pdf

Spam mail classifier - Final.pdf

Spam mail classifier - Proposal.pdf

Spam mail classifier - Proposal.pdf

Repository files navigation

INFO6101-Spam-Classifier

About

Releases

Packages

Contributors 2

Languages

kunalchugh91/INFO6101-Spam-Classifier

Folders and files

Latest commit

History

Repository files navigation

INFO6101-Spam-Classifier

About

Resources

Stars

Watchers

Forks

Languages