Skip to content

Identifying and distinguishing spam SMS and Email using the multinomial Naïve Bayes model.

License

Notifications You must be signed in to change notification settings

mohammadnabia/Multinomial-nb-Spam-Identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Multinomial naive Bayes Spam massages Identifier

Identifying and distinguishing spam massages using the multinomial Naïve Bayes model.

what is Naive Bayes classifier

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. At the time of writing this repository, there are 5 different types of Naive Bayes classifiers, which as follow:

  • 1- Bernoulli Naive Bayes classifier
  • 2- Categorical Naive Bayes classifier
  • 3- Complement Naive Bayes classifier
  • 4- Gaussian Naive Bayes classifier
  • 5- multinomial Naive Bayes classifier

In this repository, we have used the multinomial Naive Bayes classifier to detect spam messages, the reason for using this classifier is the simple implementation, high accuracy, and vector implementation method of this model. It should be noted that other methods can also be used to detect spam messages, such as the Complement Naive Bayes classifier and Tf-Idf.

Let's learn more about the Multinomial naive Bayes classifier

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors θ y = ( θ y 1 , … , θ y n ) for each class y where n is the number of features (in text classification, the size of the vocabulary) and θ y i is the probability P ( x i ∣ y ) of feature i appearing in a sample belonging to class y

The parameters θ y is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

θ ^ y i = N y i + α / N y + α n

where N y i = ∑ x ∈ T x i is the number of times feature i appears in a sample of class y in the training set T and N y = ∑ i = 1 n N y i is the total count of all features for class y

Used database

I used the smsSpamCollection dataset to train my model, which can be accessed via the link below: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Reviewing the results of the outputs of our trained model

The accuracy of our Naïve Bayes multinomial model is 99.01345291479821 % The Precision of our Naïve Bayes multinomial model is 97.88732394366197 % The Recall of our Naïve Bayes multinomial model is 94.5578231292517 %

We can use the confusion matrix to observe the performance of our model:

download

Steps

  • Import libraries
  • Upload dataset
  • Create the data frame
  • Split the data
  • Vectorize the data
  • Train & predict
  • calculate accuracy, precision, and recall
  • calculate the confusion matrix
  • Test the model with a new Sms/Email massage

More information is available in the Jupyter Notebook file