GitHub - Marvel0usx/Spam-Filter: This is an attempt to build a naive Bayes classifier from scratch.

Spam Filter Using Naïve Bayes Classifier

Build a Multinomial Naïve Bayes classifier

Let $\bf{x}$ be the vector of all words in the email.

If we want to find whether the email is a ham ( $C=c_0$ ) or a spam ( $C=c_1$ ), we need to find the conditional probability: $P(c\mid {\bf x}) = \frac{P(c,{\bf x})}{P({\bf x})}= \frac{\mathbb{L}({\bf x}\mid c)\times P(c)}{P(c_0)\times P({\bf x}\mid c_0) P(c_1)\times P({\bf x}\mid c_1)}$

Applying the "naïve" assumption that the occurrence of each word in the email is independent of each other, i.e. the sequence of words in the sentence does not matter, we have:

$\frac{P(x_1\mid c) \times P(x_2\mid c)\times ... \times P(x_n\mid c) \times P(c)}{P(c_0) P(x_1\mid c_0) \times P(x_2\mid c_0)\times ... \times P(x_n\mid c_0) P(c_1) P(x_1\mid c_1) \times P(x_2\mid c_1)\times ... \times P(x_n\mid c_1)}$ ,

where we expanded the equation of the conditional probability of $\bf{x}$ to each of its component $x_i$ , and in short: $P(c \mid {\bf x}) = \frac{[\Pi^{n}_{i=1}p(x_i\mid c)]\times P(c)}{P(c_0)\times \Pi^{n}_{i=1}P(x_i\mid c_0) P(c_1)\times \Pi^{n}_{i=1}P(x_i\mid c_1)}$

Now, we need to calculate for $p(x_i\mid c)$ and $p(c)$ , and they are calculated based on the data for training.

Laplace Correction

There will often be some words in the bag-of-words but not in the email. Originally,

$P(x_i\mid c_j) = \frac{count(x_i, c_j)}{\Sigma_{x} count(x, c_j)} = \frac{\text{number of word $x_i$ in class $c_j$}}{\text{total number of words in class $c_j$}}$

In this situation, the numerator will become 0 and the probability vanishes, and to solve this, we define that:

$P(x_i\mid c_j)= \frac{count(x_i, c_j) 1}{\Sigma_{x} count(x, c_j) |V|}$ , where $V$ is the total number of features (vocabularies).

In particular, any unknown word will have a probability of $\frac{1}{\Sigma_{x} count(x, c_j) |V|}$ .

Resources

Article on Analyticsvidhya

Stanford lecture slides

UofT lecture slides

Evaluation

Confusion Matrix

	Predicted ham	Predicted spam
Actual ham	1990	22
Actual spam	2	79

Precision: 0.782
Recall: 0.975

Contribute to this repo

comment with [dev] for development updates;
comment with [debug] for debug fix;
comment with [doc] for documentation.

Software

Restful API
Flask Backend
Flask host on firebase:
Chrome extenstion

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
ham		ham
launcher		launcher
spam		spam
test_ham		test_ham
test_spam		test_spam
.gitignore		.gitignore
README.md		README.md
SpamClassifier.ipynb		SpamClassifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

ham

ham

launcher

launcher

spam

spam

test_ham

test_ham

test_spam

test_spam

.gitignore

.gitignore

README.md

README.md

SpamClassifier.ipynb

SpamClassifier.ipynb

Repository files navigation

Spam Filter Using Naïve Bayes Classifier

Build a Multinomial Naïve Bayes classifier

Laplace Correction

Resources

Evaluation

Confusion Matrix

Contribute to this repo

Software

Other topics related to NLP/Naïve Bayes

About

Packages

Contributors 2

Languages

Marvel0usx/Spam-Filter

Folders and files

Latest commit

History

Repository files navigation

Spam Filter Using Naïve Bayes Classifier

Build a Multinomial Naïve Bayes classifier

Laplace Correction

Resources

Evaluation

Confusion Matrix

Contribute to this repo

Software

Other topics related to NLP/Naïve Bayes

About

Topics

Resources

Stars

Watchers

Forks

Languages