spam-classifier

Classify spambase dataset: https://archive.ics.uci.edu/ml/datasets/Spambase

Statistics are based on a 70/30 training set split averaged for 50 runs.

sklearn

Naive Bayes

Method	Accuracy Avg	Accuracy Std	AUC Avg	AUC Std	Top 5 Features (y=ham)	Top 5 Features (y=spam)
Gaussian	0.808000	0.00781	0.84982	0.00714	[('650', 1.2760192697768751), ('credit', 1.2476267748478689), ('hpl', 0.88242393509127726), ('people', 0.52748478701825663), ('font', 0.4292748478701825)]	[('credit', 2.2555208333333345), ('font', 1.3963862179487192), ('people', 0.54309294871794844), ('business', 0.53835737179487231), ('over', 0.50399038461538403)]
Multinomial	0.87230	0.00714	0.95302	0.00390	[('650', -1.9278529931605961), ('credit', -1.9801256215148824), ('hpl', -2.3107274101918582), ('people', -2.868116591218965), ('edu', -3.0283909589368152)]	[('credit', -1.4507395482450338), ('font', -1.9440271739962247), ('people', -2.8479883095744416), ('business', -2.90044359646099), ('over', -2.9511096702487905)]
Bernoulli (alpha=1.0, bin=0.31)	0.89164	0.00528	0.95017	0.00403	[('credit', -0.62493334953781776), ('people', -1.0074197209180786), ('hpl', -1.0558622403769), ('font', -1.2449217749113739), ('george', -1.3607226493637654)]	[('credit', -0.12910183231238115), ('font', -0.24123449696324872), ('people', -0.55563963955578277), ('over', -0.72801086549656979), ('3d', -0.73629591703067643)]

Decision Trees

Method	Accuracy Avg	Accuracy Std	AUC Avg	AUC Std	Top 5 Features
DecisionTreeClassifier(criterion="entropy")	0.91136	0.00758	0.91214	0.00853	[('remove', 0.21730246077008433), ('free', 0.1383136780699713), ('hp', 0.078559911850530045), ('money', 0.066369886207552659), ('george', 0.054680657133664989)]

Random Forest

// TODO

[('free', 0.098474220547133465), ('remove', 0.092541805565311538), ('your', 0.09022347711132328), ('you', 0.061481529787477354), ('000', 0.061045877750548337)] Accuracy. Avg: 0.93936, Std: 0.00656 AUC. Avg: 0.97864, Std: 0.00362 [Finished in 34.691s]

My Implementation

Naive Bayes

Method	Accuracy Avg	Accuracy Std	AUC Avg	AUC Std	Top 5 Features (y=ham)	Top 5 Features (y=spam)
Gaussian	0.80858	0.01097	0.85702	0.00859	[('650', 1.3194578005115096), ('credit', 1.2579437340153434), ('hpl', 0.94403580562659861), ('people', 0.53731969309462912), ('george', 0.42955498721227592)]	[('credit', 2.2747826086956491), ('font', 1.3878418972332005), ('business', 0.54054545454545522), ('people', 0.5380316205533594), ('over', 0.51667193675889345)]
Multinomial	0.86884	0.00866	0.95108	0.00528	[('credit', 0.14393208823250517), ('650', 0.14351229786150893), ('hpl', 0.09480940574430724), ('people', 0.056671145539713377), ('font', 0.048257592687994753)]	[('credit', 0.23500355935145184), ('font', 0.14540852047528446), ('people', 0.056419703295146083), ('over', 0.052507351381355288), ('business', 0.051720856290241896)]
Bernoulli (alpha=1.0, bin=0.31)	0.87806	0.00736	0.95719	0.00401	[('credit', 0.10577409242592786), ('people', 0.072504803316816663), ('hpl', 0.069268884619273968), ('font', 0.057033067044190547), ('george', 0.050156739811912252)]	[('credit', 0.1190450352685837), ('font', 0.1069994574064025), ('people', 0.077590884427563678), ('3d', 0.066087900162778032), ('over', 0.065653825284861606)]

Analysis

Naive Bayes

Three methods of Naive Bayes classifiers were tested: Gaussian distribution, multinomial, and multi-variate Bernoulli. The Gaussian NB assumes that the values of each feature are continuous and distributed normally. In multinomial NB, a document d is modeled as the outcome of |d| independent trials from the vocabulary. Typically, a document is represented as a vector of word counts or word frequencies. The multi-variate Bernoulli NB represents a document as a binary vector over the space of the vocabulary. Each document can be seen as a collection of multiple independent Bernoulli experiments, one for each word in the vocabulary [1].

Table 1. Accuracy and AUC for Naive Bayes Methods

Method	Accuracy	AUC
Gaussian	80.800% +/- 0.216%	84.982% +/- 0.198%
Multinomial	87.230% +/- 0.198%	95.302% +/- 0.108%
Bernoulli	89.164% +/- 0.146%	89.164% +/- 0.112%

Table 1 summarizes metrics for each method. Based on accuracy, Bernoulli appears to be the better classifier, however multinomial beats it out based on AUC. This is to be taken with a grain of salt, as a study [2] does not believe "standard auc is a good measure for spam filters, because it is dominated by non-high specificity (ham recall) regions, which are of no interest in practice."

Table 2. Multinomial model training with frequency vs binary word occurrence vectors

Method	Accuracy	AUC
Frequency	87.230% +/- 0.198%	95.302% +/- 0.108%
Binary	87.954% +/- 0.230%	95.756% +/- 0.136%

Previous research [3] inspired a multinomial model to be trained using binary word occurrence vectors instead of frequency vectors. The results in Table 2 show a slight increase in both accuracy and AUC when using a binary word occurrence vector instead of the usual word frequency vector, which is consistent with the findings of [3].

Although [2] demonstrates that the binary multinomial model should yield better results than the Bernoulli model, this did not occur with the given data. This is because the vocabulary is not large enough, as shown from the accuracy results in [3]. I suspect that an increase in vocabulary size would show the multinomial model surpasses the Bernoulli model.

References

[1] A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification", AAAI-98 workshop on learning for text categorization, vol. 752, pp. 41-48, 1998.

[2] V. Metsis, I. Androutsopoulos and G. Paliouras, "Spam Filtering with Naive Bayes – Which Naive Bayes?", in Conference on Email and Anti-Spam, Mountain View, California USA, 2006.

[3] K. Schneider, "On word frequency information and negative evidence in Naive Bayes text classification", EsTAL, vol. 3230, pp. 474-486, 2004.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
out		out
.gitignore		.gitignore
README.md		README.md
decision_tree_sklearn.py		decision_tree_sklearn.py
load_data.py		load_data.py
my_test.py		my_test.py
naive_bayes.py		naive_bayes.py
random_forest_sklearn.py		random_forest_sklearn.py
sklearn_test.py		sklearn_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

out

out

.gitignore

.gitignore

README.md

README.md

decision_tree_sklearn.py

decision_tree_sklearn.py

load_data.py

load_data.py

my_test.py

my_test.py

naive_bayes.py

naive_bayes.py

random_forest_sklearn.py

random_forest_sklearn.py

sklearn_test.py

sklearn_test.py

Repository files navigation

spam-classifier

sklearn

Naive Bayes

Decision Trees

Random Forest

My Implementation

Naive Bayes

Analysis

Naive Bayes

References

About

Releases

Packages

Languages

sampepose/SpamClassifier

Folders and files

Latest commit

History

Repository files navigation

spam-classifier

sklearn

Naive Bayes

Decision Trees

Random Forest

My Implementation

Naive Bayes

Analysis

Naive Bayes

References

About

Resources

Stars

Watchers

Forks

Languages