ML Capstone

Description

Spam is unwanted and unsolicited messages sent electronically. These spam messages often have malicious intent, and range from misleading advertising to phishing and malware spreads. Thus, spam is detrimental to both users and services, and creates mistrust and wariness between the two parties.

Furthermore, spam is rapidly on the rise, with the Federal Trade Commision reporting $8.8 billion in total reported losses in 2022, compared to the $6.1 billion in 2021 and the mere $1.2 billion in 2020 (FTC.gov). Therefore, it is increasingly important for companies and services to detect and filter spam messages.

My capstone project aims to create an ML model that is capable of detecting spam email messages.

My full project proposal can be found at: ProjectProposal.pdf.

Data

FullData.csv.zip includes 39,763 entries, with 20,695 labeled as ham and 19,068 as spam.
The dataset contains the following columns:

Column	Description
Subject	The subject line of the email
Message	The content of the e-mail. Can contain an empty string if the message had only a subject line and no body. In case of forwarded emails or replies, this also contains the original message with subject line, "from:", "to:", etc.
Label	Whether the email was spam (1) or not (0)

This dataset is made up of two premade datasets:

Further details about these two datasets can be found at: DataCollection.md.

DataCollection.py contains the script for processing the SpamAssassin dataset, as well as merging it with the Enron Spam dataset before outputting it as Data.csv.

Model Benchmarking

I created a Custom Classifier Model in Amazon Comprehend in order to benchmark my model. After training, the model had an accuracy score of 0.99+ as well as an F1 score of 0.99+.

Further details can be found at: ComprehendResults.md.

Data Wrangling & Exploration

Data Wrangling:

I first concatenated the Subject and Message columns of my data into a single column.

Next, my text preprocessing/normalization process was as follows:

Transform each token to lower case
Replace URLs with the string 'URL'
Replace emails with the string 'email'
Replace numbers with the string 'number'
Remove any extra newlines or whitespace
Remove stopwords
Remove non-ASCII characters

I chose not to remove punctuation as I believed it to be important in the detection of Spam emails.

The zipped version of my preprocessed data can be found here: CleanData.csv.zip.

Data exploration:

The key takeways were as follows:

My dataset was fairly balanced: 52% ham to 48% spam.
The most common tokens were numbers, punctuation, and emails as well as the words 'enron' , 'ect', 'company', and 'subject'.
a. Ham emails commonly featured emails and the words 'enron', 'subject', 'ect', and 'energy'.
b. Spam emails commonly featured urls and the words 'company', 'information', and 'font'.
The average email length was around 201 tokens, but the longest email contained 28,624 tokens.

The Jupyter Notebook for this phase can be found here: DataWrangling&Exploration.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Benchmarking		Benchmarking
DataCollection		DataCollection
CleanData.csv.zip		CleanData.csv.zip
DataWrangling&Exploration.ipynb		DataWrangling&Exploration.ipynb
Deployment Method and Architecture.pdf		Deployment Method and Architecture.pdf
ExperimentWithEnsembleModels.ipynb		ExperimentWithEnsembleModels.ipynb
FullData.csv.zip		FullData.csv.zip
ProjectProposal.pdf		ProjectProposal.pdf
Prototyping&Scaling.ipynb		Prototyping&Scaling.ipynb
README.md		README.md
ReproductionOfAvailableSolutions.ipynb		ReproductionOfAvailableSolutions.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking

Benchmarking

DataCollection

DataCollection

CleanData.csv.zip

CleanData.csv.zip

DataWrangling&Exploration.ipynb

DataWrangling&Exploration.ipynb

Deployment Method and Architecture.pdf

Deployment Method and Architecture.pdf

ExperimentWithEnsembleModels.ipynb

ExperimentWithEnsembleModels.ipynb

FullData.csv.zip

FullData.csv.zip

ProjectProposal.pdf

ProjectProposal.pdf

Prototyping&Scaling.ipynb

Prototyping&Scaling.ipynb

README.md

README.md

ReproductionOfAvailableSolutions.ipynb

ReproductionOfAvailableSolutions.ipynb

Repository files navigation

ML Capstone

Description

Data

Model Benchmarking

Data Wrangling & Exploration

Reproduction of Available Solutions

About

Releases

Packages

Languages

anastasiaarsky/ML-Capstone-Spam-Classification

Folders and files

Latest commit

History

Repository files navigation

ML Capstone

Description

Data

Model Benchmarking

Data Wrangling & Exploration

Reproduction of Available Solutions

About

Resources

Stars

Watchers

Forks

Languages