Web App Link :- https://gaurav-van-toxic-comment-web-app-app-24y37c.streamlitapp.com/
Project Repo: https://github.com/Gaurav-Van/Data_Science__Machine_Learning-Projects
Classifying Comments in Six different Categories including their Neutral Cases Using Concepts of NLP and ML
- Toxic
- Severe Toxic
- Threat
- Obscene
- Insult
- Identity Hate
Instead of Multiclass classification, Binary Classification of Each Category is performed
1. Data Collection - From Kaggle: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
2. Data Pre-Procesing - Text Pre-Processing Using Regular Expressions
- Removing \n characters
- Removing Aplha-Numeric Characters
- Removing Punctuations
- Removing Non Ascii Characters
3. EDA - Performaing Data analysis to Discover some Issues and trend of the Data
- Through Bar charts of Each Category :- Prob = Class Imbalance -> Solution = Making Frequency of 0s equal to Frequency of 1s by Making Different Dataset of each Category [ id, comment_text, category].
- Helps to solve the Issue of Class Imbalance and Helps in Binary Classification of Each Category
4. Model Building
- VECTORIZATION :- Using TF-IDF and Unigram Approach
- Model Used For Each Category :- KNN, Logistic Regression, SVM, CNB, BNB, DT and RF
- Model Selected/b> - Logistic Regression
- Exporting Trained ML Models as 6 pickle files [ one of each category ]
- Exporting Trained Vectorized Models as 6 pickle files [ one for each category ]
5. Deployment - Building web app with the help of streamlit and deploying it on Streamlit cloud