This project details the various steps that I took to build my own spam email classifier that is able to classify an email as spam or non-spam(ham) via a set of its features.
I obtained the dataset for this project through Kaggle: https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset?resource=download. In particular, I have used the 'completeSpamAssassin.csv' dataset for this project. It simply includes a serial number column that can be used as the index, a body column that contains the actual text content of each email, and the label column that is 0 for ham emails and 1 for spam emails as seen in the figure below.
From the dataset, it is straightforward that we may use a vectorizer (count/TD-IDF) to extract features from the email body column. However, for this project, I have implemented some feature engineering to attempt at producing more features (in addition to the ones extracted from a TF-IDF vectorizer) that can enhance the performance of the spam email classifier. I have also implemented several different Machine Learning algorithms to identify the best one.
Topics covered in this project are:
- Supervised Machine Learning
- Binary Classification
- Natural Language Processing
- Data Visualisation
- Exploratory Data Analysis & Manipulation
- Feature Engineering & Extraction
- Model Evaluation