Skip to content

waleedaliSe/Spam-Detection-Using-Weka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam Detection Using Weka

Problem Statement

We want to check that email is spam or not through text classification in WEKA.

Introduction

Every day E-mail users receive hundreds of spam messages with a new content, from new addresses which are automatically generated by robot software. To filter spam with traditional methods as black-white lists (domains, IP addresses, mailing addresses) is almost impossible. Application of text mining methods to an E-mail can raise efficiency of a filtration of spam. Also classifying spam messages will be possible to establish thematic dependence from geographical.

DATASET

We have 5180 emails as dataset in three folders norm for normal, ham for harm and spam for Spam. Dataset features are as follows. 1. A particular word or character was frequently occurring in the e-mail. 2. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. Source: Spam Email Datasets

Implimentation

C4.5(J48) Algorithm

We are using C4.5 algorithm known as j48 algorithm in WEKA. C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining.

A decision tree is a graph that uses a branching method to illustrate every possible outcome of a decision. Programmatically, they can be used to assign monetary/time or other values to possible outcomes so that decisions can be automated.

Following is example decsion tree.

Naive Bayes Algorithm Implimentation

We are also comparing results with results get from Naive Bayes. It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

How to Run?

Source: Downloading and installing Weka

1. Normalize Data

The dataset we get was in 5180 text files as shown in following screenshots.

(3 Folders in one Folder)

(Insight one folder)

(One File)
As you can see in above screenshots that data is not normalized and well manage and we cannot give it as input to WEKA. We need all data in one arff file to give it WEKA for training. So for this purpose we use following command in command line interface of WEKA. “java weka.core.converters.TextDirectoryLoader -dir F:/Spam_mails > F:/text_example.arff”

The output arff file is following.

Now data is normalized. So we will give it to the WEKA for further pre-processes as follows.

C4.5(J48) Algorithm Implimentation

Now we will select and apply our classifier as follows.

We are using every frequent word as feature so here we will break string in word vector as follows.

First we train data as follows by selecting train data set option and we get following results.

Here we get 98% accuracy. Than we further train it on different split percentages and get following results.

On 66% split percentage we get 93% accuracy.

On 80% split percentage we get 94% percent accuracy.

On 90% split percentage we get 89% accuracy.
Now we decided to test our model, so we make test dataset from our own email ids as shown in following screenshot.

Now we give this test dataset to our trained model and we get following predictions about this dataset.

Our model give prediction as shown in above screenshot.

Naive Bayes Algorithm Implimentation

I repeat the same procedure with Naïve Bayes shown in following snapshots.

It shows different results with good accuracy.

Conclusion

Email spamming is a common technique but can make heavy damage to user’s privacy. Currently, many anti-spam tools are available to fight against spam mails. But text classification is one the best ways to detect email spamming. We can improve it's accuracy with very big dataset and restrict our algorithms to ignore normal dictionary words and classify frequently used spam words.

References

[1] A. Anderson, M. Corney, O. de Vel, and G. Mohay."Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001.
[2] Shlomo Hershkop, Ke Wang, Weijen Lee, Olivier Nimeskern, German Creamer, and Ryan Rowe, "Email Mining Toolkit Technical Manual". (June 2006) Department of Computer Science Columbia University.
[3] Bron, C. and J. Kerbosch. "Algorithm 457: Finding all cliques of an undirected graph." (1973).
[4] Ding Zhou et al and Ya Zhang, "Towards Discovering Organizational Structure from Email Corpus". (2005) Fourth International Conference on Machine Learning and Application.
[5] Giuseppe Carenini, Raymond T. Ng and Xiaodong Zhou , "Scalable Discovery of Hidden Emails from Large Folders". Department of Computer Science, University of British Columbia, Canada.
[6] Hung-Ching Chen el al, "Discover The Power of Social and Hidden Curriculum to Decision Making: Experiments with Enron Email and Movie Newsgroups". Sixth International Conference on Machine Learning and Applications.