Let be the vector of all words in the email.
If we want to find whether the email is a ham () or a spam (), we need to find the conditional probability:
Applying the "naïve" assumption that the occurrence of each word in the email is independent of each other, i.e. the sequence of words in the sentence does not matter, we have:
where we expanded the equation of the conditional probability of to each of its component , and in short:
Now, we need to calculate for and , and they are calculated based on the data for training.
There will often be some words in the bag-of-words but not in the email. Originally,
In this situation, the numerator will become 0 and the probability vanishes, and to solve this, we define that:
, where is the total number of features (vocabularies).
In particular, any unknown word will have a probability of .
Predicted ham | Predicted spam | |
---|---|---|
Actual ham | 1990 | 22 |
Actual spam | 2 | 79 |
- Precision: 0.782
- Recall: 0.975
- comment with
[dev]
for development updates; - comment with
[debug]
for debug fix; - comment with
[doc]
for documentation.
- Restful API
- Flask Backend
- Flask host on firebase:
- Chrome extenstion
- Chatbot (Hard)
- WeChat bot
- Discord bot
- Emotion Analysis (Easy) - Bilibili, Netease, etc.
- Maybe we can do a comparison between classic algorithms with neural networks
- Voice2Text/Video2Text
- Generative fake news (Hard)
- Autocomplete/Autocorrect/Spell Check (Hardest)
- Search engine optimization
- Duplicate Detection
- Algorithmic Trading
- Streamlining patient information