Skip to content

Classifying whether text messages are spam or ham using text analysis implemented using NLTK and machine learning algorithm.

Notifications You must be signed in to change notification settings

Prince2124/Spam_or_Ham_Msg_Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spam or Ham Message Classification

The purpose of this project is to understand how we can use machine learning algorithm to build SMS spam detection model. Particularly, we will build a binary classification model to detect whether a text message is spam or not. A lot of times they ask us to fill in forms and ask our personal information or Account number details which is really fishy or bound to be a fraud. The goal of this project is to use Data Science to accurately classify whether a message is spam or not.

image

Data

The dataset we used to build our model is from the UIC repostiory. The link to the dataset is https://archive.ics.uci.edu/ml/datasets/sms+spam+collection. The dataset has 2 columns, the text messages and the label. The dataset contains 5572 text messages which are appropriately labelled ham and spam. The dataset is imbalanced in nature with 4825 instances of ham class and 747 instances of spam class.

Data cleaning and preprocessing

The data is a bunch of text messages which are personal conversations as well as spam texts. So, we pre-processed the data by by performing the following steps: A. The messages contained phone numbers, email addresses, http links, money symbols, and other numbers which is replace by with whitespace using Implemented regular expressions. B. Converted the messages to lowercase. C. Implemented stemming on words to bring them to their root form. D. The important words are the tokens other than the stop words. So, we removed the stopwords from the messages as they do not act as the differentiating factor. E. Then, implemented CountVectorizer for the data modeling process.

Execution of the process

In the project has a jupyter notebook spam_classification.ipynb which is used for the implementation of this project.

Screenshort of Confusion matricx of different classifire model

A.Decision Tree Classifier

Using_DecisionTreeClassifier

B.Naive Bayes Classifier

Using_NaiveBayesClassifier

C.SVM_Classifier

Using_SVM_Classifier

Results

The results of the CountVectorizer data:

A. Using Naive bayes classifier Model>>>>>>>>>>>>>>>>>>>>>Accuracy:98.7

B. Using Decision Tree Classifier(Entropy)>>>>>>>>>>>>>>>>Accuracy:98.1

C. Using SVM with linear Model>>>>>>>>>>>>>>>>>>>>>>>>>>>>Accuracy:98.7

Reference

https://github.com/krishnaik06/SpamClassifier

About

Classifying whether text messages are spam or ham using text analysis implemented using NLTK and machine learning algorithm.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published