Skip to content

30lm32/ml-spam-sms-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image

Which one does it catch whole* SPAM SMS?

Problem Data Methods Libs Link
NLP Text Naive Bayesian, SVM, Random Forest Classifier, Deep Learning - LSTM, Word2Vec Sklearn, Keras, Gensim, Pandas, Seaborn https://github.com/erdiolmezogullari/ml-spam-sms-classification

If you want to see the further ML projects, you may visit my main repo: https://github.com/erdiolmezogullari/ml-projects

In this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).

We used a public SMS Spam dataset, which is not a purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either spam or ham corresponding to related SMS context.

Before applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.

After obtaining the cleaned dataset, we created tokens and lemmas of SMS corpus separately by using Spacy, and then, we generated bag-of-word and TF-IDF of SMS corpus, respectively. In addition to these data transformations, we also performed SVD, SVC, PCA to reduce dimension of dataset.

To manage data transformation in the training and testing phase effectively and avoid data leakage, we used Sklearn's Pipeline class. So, we added each data transformation step (e.g. bag-of-word, TF-IDF, SVC) and classifier (e.g. Naive Bayesian, SVM, Random Forest Classifier) into an instance of class Pipeline.

After applying those supervised learning methods, we also performed deep learning. The deep learning architecture we used is based on LSTM. To perform LSTM approaching in Keras (Tensorflow), we needed to create an embedding matrix of our corpus. So, we used Gensim's Word2Vec approach to obtain embedding matrix, rather than TF-IDF.

At the end of each processing by using a different classifier, we plotted confusion matrix to compare which one the best classifier for filtering SPAM SMS.