Spam Classification

This repository contains the code and resources for a spam classification capstone project that includes both a machine learning and a deep learning solution. The project aims to compare the performance of these two techniques for accurately classifying email messages as spam or not spam. It also contains a guide to deploy the best-performing model as a static web app on AWS to provide a user-friendly interface for email spam classification.

Introduction
Summary of Project Features
Getting Started
- Prerequisites
- Installation
Usage
Dataset
- Description
- Preprocessing
Feature Building
- Statistical Vectorization
- Distributed Representation Vectorization
Results
Deployment
- Input/Output Stream
- AWS Tools Used
Acknowledgments
Contact Information

Introduction

Spam messages are a common issue in communication platforms. This project explores both traditional machine learning and modern deep learning approaches to tackle the spam classification problem. By comparing the performance of these two different techniques, I aim to identify the most effective solution for accurate spam detection.

Summary of Project Features

Data Collection: Collect and curate a labeled dataset of email messages for training and evaluation.
Text Preprocessing: Automate text cleaning and tokenization to prepare raw text for analysis.
Feature Extraction: Extract meaningful features from text TF-IDF (Term Frequency-Inverse Document Frequency) for the ML model.
Word Embedding: Convert text into dense vector representations suitable for NLP tasks using pretrained FastText word embeddings for the DL model.
Text Classification: Implement spam classification models (one ML and one DL) to categorize the text in email messages as spam or not spam.
- ML model: Random Forest Classifier (RF)
- DL model: Convolutional Neural Network (CNN)

Getting Started

Prerequisites

To run this project, you'll need the following prerequisites:

Python 3.10
Keras 2.13 (with TensorFlow backend)
scikit-learn
NumPy
Pandas
Jupyter Notebook (optional for data exploration)

Installation

Clone this repository to your local machine:

git clone https://github.com/anastasiaarsky/email-spam-classification.git

Navigate to the project directory:
```
cd email-spam-classification
```
Install the required Python libraries:
```
pip install -r requirements.txt
```

Usage

Data Collection and Preprocessing

The two datasets used (Spam Assassin & Enron Spam) are already located in the data/external_data/ directory.
Collect and preprocess the data by running the preprocessing script:
```
python -m src.data.make_dataset
```
- Both the raw and preprocessed data will be saved as zipped CSV files (named raw_data.zip and processed_data.zip) in the data/ directory.

ML Model Training and Evaluation

Train the Random Forest model (with TF-IDF feature extraction):
```
python -m src.models.rf_model --train
```
- This will perform TF-IDF feature extraction on the training and validation sets and use the training set to train the model.
- The model will be saved in the models/ directory as random_forest_model.joblib.
- Evaluation metrics and a confusion matrix will be printed for the validation set.
Run predictions once the Random Forest model has been trained and saved:
```
python -m src.models.rf_model
```
- This will perform TF-IDF feature extraction on the testing set and use the trained RF model that was saved in the models/ directory for predictions.
- Evaluation metrics and a confusion matrix will be printed for the testing set.

DL Model Training and Evaluation

Download the FastText pretrained word embeddings (wiki-news-300d-1M-subword.vec.zip) from the FastText website and place them in the data/ directory.
Train the CNN model (using the FastText pretrained word embeddings):
```
python -m src.models.cnn_model --train
```
- This will first vectorize the training and validation sets and use them to create an embedding layer for the CNN.
  - The tokenizer used to vectorize the training and validation sets will be saved as tokenizer.pickle in the models/ directory.
- Then the CNN will be trained and saved as cnn_model.h5 in the models/ directory.
Run predictions using the trained CNN model:
```
python -m src.models.cnn_model
```
- This will vectorize the testing set and use the trained CNN model that was saved in the models/ directory for predictions.
- Evaluation metrics and a confusion matrix will be printed for the testing set.

Dataset

Description

The dataset includes 39,763 entries, with 19,068 labeled as spam and 20,695 as ham (ie legitimate).

It is made up of two publicly available datasets (located in data/external_data/):

SpamAssassin dataset
Enron Spam dataset curated by Marcel Wichmann

Preprocessing

To prepare the dataset for model training, I applied these preprocessing steps to the email text:

Text was transformed to lowercase.
URLs, email addresses, and numeric values were replaced with 'url', 'email', and 'number', respectively.
Non-ASCII characters were removed or decoded.
Most punctuation was removed, only retaining essential marks like '$', '!', '.', and '?'.
Common stopwords were removed.
Extra newlines and whitespace were cleaned up.

These steps ensured clean and standardized text data, which is crucial for NLP tasks like spam detection.

You can find the raw and preprocessed datasets in the data/ directory. The code for data collection and preprocessing are located in the src/data/ directory.

Feature Building

After applying data preprocessing techniques, I employed feature engineering methods to prepare the email text data for spam classification.

Statistical Vectorization

For my traditional machine learning model, I used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text data into numerical feature vectors. TF-IDF captures the importance of words in the document relative to the entire dataset.

The code for building the TF-IDF features can be found in the src/features/ directory, under build_tfidf_features.py.

Distributed Representation Vectorization

For my deep learning model, I leveraged FastText word embeddings to represent words in a dense vector space, capturing semantic relationships between words.

The code for building the FastText features can be found in the src/features/ directory, under build_fasttext_features.py.

Results

My DL model (CNN that leveraged FastText word embeddings) took slightly more CPU time to train compared to the simple ML model (Random Forest with TF-IDF). However, this difference in training time (3m 23s vs 1m 8s) was negligible considering my DL model boasted a higher accuracy (98.70% vs 98.24%), as well as a higher recall, precision, and f1 score.

Therefore, I decided to go ahead with my CNN model as my final model for deployment.

Below is a comparison of the two models:

Model	Feature Extraction Method	Training Time (CPU)	Accuracy	F1 Score	Recall	Precision
Random Forest	TF-IDF	1min 8s	98.24%	98.24%	98.24%	98.24%
CNN	FastText Word Embeddings	3min 30s	98.70%	98.70%	98.72%	98.69%

A more detailed report on the model selection process and results can be found in the reports/ directory.

Deployment

I have deployed this project as a static web application on AWS. Users input email text and receive a classification from my DL model of spam or ham (ie legitimate).

Input/Output Stream

AWS Tools Used

Amazon S3: Hosts the web application
Amazon API Gateway: Provides serverless API to handle user requests and interact with the deployed DL model
Amazon SageMaker: Deploys the DL model and makes it accessible via the API.
AWS Lambda: Integrates the Amazon SageMaker endpoint to the front-end web app, and performs the necessary input/output processing.
AWS X-Ray and Amazon CloudWatch: Monitor and debug the application by allowing me to analyze application behavior, identify issues, and optimize performance.

By leveraging these AWS services, I have created a robust and scalable deployment that enables users to interact with my spam classification model through a user-friendly web interface.

For more information on model deployment architecture and step-by-step instructions, go to the deployment/ directory.

Acknowledgments

I'd like to express my appreciation to the following:

SpamAssassin Dataset and Enron Spam Dataset Contributors: I appreciate the SpamAssassin project and Marcel Wiechmann (who created a CSV version of the Enron Spam dataset) for providing the spam data that was instrumental in training and evaluating my models.
FastText Word Embeddings: My gratitude to the FastText team for their word embeddings, which improved my DL model's natural language processing components.
Deployment Resources: I'd like to thank the authors of the AWS Machine Learning Blog, as well as Austin Lasseter for his Medium article, as both of these resources guided me in deploying my model with Amazon SageMaker and Lambda.

I also thank the open-source community and library authors for their valuable contributions to this project.

Contact Information

For questions or collaborations, feel free to contact me:

Email: anarsky@gmail.com
GitHub: anastasiaarsky

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
data		data
deployment		deployment
models		models
notebooks		notebooks
reports		reports
src		src
README.md		README.md
requirements.txt		requirements.txt

anastasiaarsky/email-spam-classification

Folders and files

Latest commit

History

Repository files navigation