Skip to content

Marvel0usx/Spam-Filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam Filter Using Naïve Bayes Classifier

Build a Multinomial Naïve Bayes classifier

Let be the vector of all words in the email.

If we want to find whether the email is a ham () or a spam (), we need to find the conditional probability:

Applying the "naïve" assumption that the occurrence of each word in the email is independent of each other, i.e. the sequence of words in the sentence does not matter, we have:

,

where we expanded the equation of the conditional probability of to each of its component , and in short:

Now, we need to calculate for and , and they are calculated based on the data for training.

Laplace Correction

There will often be some words in the bag-of-words but not in the email. Originally,

In this situation, the numerator will become 0 and the probability vanishes, and to solve this, we define that:

, where is the total number of features (vocabularies).

In particular, any unknown word will have a probability of .

Resources

Article on Analyticsvidhya

Stanford lecture slides

UofT lecture slides

Evaluation

Confusion Matrix

Predicted ham Predicted spam
Actual ham 1990 22
Actual spam 2 79
  • Precision: 0.782
  • Recall: 0.975

Contribute to this repo

  • comment with [dev] for development updates;
  • comment with [debug] for debug fix;
  • comment with [doc] for documentation.

Software

  • Restful API
  • Flask Backend
  • Flask host on firebase:
  • Chrome extenstion

Other topics related to NLP/Naïve Bayes

  • Chatbot (Hard)
    • WeChat bot
    • Discord bot
  • Emotion Analysis (Easy) - Bilibili, Netease, etc.
    • Maybe we can do a comparison between classic algorithms with neural networks
    • Voice2Text/Video2Text
  • Generative fake news (Hard)
  • Autocomplete/Autocorrect/Spell Check (Hardest)
  • Search engine optimization
  • Duplicate Detection
  • Algorithmic Trading
  • Streamlining patient information

About

This is an attempt to build a naive Bayes classifier from scratch.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published