This project aims to classify text messages as spam or ham (non-spam) using machine learning models (Naive Bayes, Random Forest, Decision Tree, K Nearest Neighbor) and Natural language techniques.
- Installation
- Dataset
- Data Preprocessing
- Word Cloud for Spam and Ham Messages
- Data Transformation
- Results
- Contributing
- [License](#license
-
Clone the repository:
git clone https://github.com/Elilora/spam-text-classification.git cd spam-text-classification
-
Install the required libraries:
pip install numpy pandas seaborn matplotlib scikit-learn xgboost shap
The dataset used is gotten from Kaggle https://www.kaggle.com/team-ai/spam-text-message-classification with 5157 unique values
Data preprocessing steps include calculating the length of messages, exploring basic statistics, and visualizing the distribution of message categories.
Word clouds are generated to visualize the most common words in spam and ham messages.
Text data is preprocessed by removing special characters, converting text to lowercase, tokenizing, stemming, removing stopwords, and expanding contractions.
The following models were used to predict stroke occurrence:
- Naive Bayes Classifier
- K-Nearest Neighbors Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Split the data into training and testing sets (70% train, 30% test).
- Train the models on the training set.
- Evaluate the models on the testing set using accuracy and F1-score metrics.
- Generate classification reports and confusion matrices.
Various machine learning models are trained and evaluated using the processed data.
The Multinomial Naive Bayes classifier is trained and evaluated for text classification.
- Accuracy: 97%
- F1-score: 99%
The K-Nearest Neighbors classifier with n_neighbors=2
is trained and evaluated for text classification.
- Accuracy: 93%
- F1-score: 96%
The Decision Tree classifier is trained and evaluated for text classification.
- Accuracy: 97%
- F1-score: 98%
The Random Forest classifier is trained and evaluated for text classification.
- Accuracy: 98%
- F1-score: 99%
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.