Data Mining: Naive Bayes and Decision Tree Classifiers

Naive Bayes and Decision Tree Classifiers Implemented with Scikit-Learn and Graphviz Visualization
Datasets:
- News (subset of 20 Newsgroups dataset, with testing label)
- Mushroom (with testing label)
- Income (UCI Adult Income dataset, with no testing label)

Environment

< scikit-learn 0.20.1 >
< numpy 1.15.4 >
< pandas 0.23.4 >
< Python 3.7 >
< tqdm 4.28.1 > (optional - progress bar)
< graphviz 0.10.1 > (optional - visualization)

File Description

.
├── src/
|   ├── classifiers.py ----------> Implementation of the naive bayes and decision tree classifiers
|   ├── data_loader.py ----------> Data loader that handles the reading and preprocessing of all 3 datasets
|   └── runner.py ---------------> Runner that runs all modes: train + evaluate, search optimal model, visualize model, etc.
├── data/ -----------------------> unzip data.zip
|   ├── income
|   |   ├── income_test.csv
|   |   ├── income_train.csv
|   |   ├── income.names
|   |   └── sample_output.csv
|   ├── mushroom
|   |   ├── mushroom_test.csv
|   |   ├── mushroom_train.csv
|   |   ├── mushroom.names
|   |   └── sample_output.csv
|   └── news
|       ├── news_test.csv
|       ├── news_train.csv
|       └── sample_output.csv
├── image/ ----------------------> visualization and program output screen shots
├── result/ ---------------------> model prediction output
├── problem_description.pdf -----> Work spec
└── Readme.md -------------------> This file

Usage

Data

Unzip data.zip with: unzip data.zip

Naive Bayes Classifier

Train and test with the best alpha parameter for the best distribution assumption of the Naive Bayes classifier:
- News dataset: python3 runner.py --naive_bayes --data_news
- Mushroom dataset: python3 runner.py --naive_bayes --data_mushroom
- Income dataset: python3 runner.py --naive_bayes --data_income
Search for the best alpha parameter for each distribution assumption of the Naive Bayes classifier:
- Add the --search_opt argument
- News dataset (validated on the testing set):
```
python3 runner.py --naive_bayes --search_opt --data_news
```
- Mushroom dataset (validated on the testing set):
```
python3 runner.py --naive_bayes --search_opt --data_mushroom
```
- Income dataset (Using N-fold cross-validation on the training set):
```
python3 runner.py --naive_bayes --search_opt --data_income
```
Compare all distribution assumption of the Naive Bayes classifier with their own best alpha parameter:
- Add the --run_all argument
- News dataset: python3 runner.py --naive_bayes --run_all --data_news
- Mushroom dataset: python3 runner.py --naive_bayes --run_all --data_mushroom
- Income dataset: python3 runner.py --naive_bayes --run_all --data_income

Decision Tree Classifier

Train and test with the best max depth parameter for the Decision Tree classifier:
- News dataset: python3 runner.py --decision_tree --data_news
- Mushroom dataset: python3 runner.py --decision_tree --data_mushroom
- Income dataset: python3 runner.py --decision_tree --data_income
Search the best max depth parameter for the Decision Tree classifier:
- Add the --search_opt argument
- News dataset (validated on the testing set):
```
python3 runner.py --decision_tree --search_opt --data_news
```
- Mushroom dataset (validated on the testing set):
```
python3 runner.py --decision_tree --search_opt --data_mushroom
```
- Income dataset (Using N-fold cross-validation on the training set):
```
python3 runner.py --decision_tree --search_opt --data_income
```
Visualize the Decision Tree classifier with the best max depth parameter:
- Add the --visualize_tree argument
- News dataset: python3 runner.py --decision_tree --visualize_tree --data_news
- Mushroom dataset: python3 runner.py --decision_tree --visualize_tree --data_mushroom
- Income dataset: python3 runner.py --decision_tree --visualize_tree --data_income

Result - Naive Bayes Performance

News Dataset - Testing Set Acc

naive_bayes.GaussianNB() => 0.80979 (baseline)
naive_bayes.MultinomialNB(alpha=0.065) => 0.89511
naive_bayes.ComplementNB(alpha=0.136) => 0.88811
naive_bayes.BernoulliNB(alpha=0.002) => 0.82727

Mushroom Dataset - Testing Set Acc

naive_bayes.GaussianNB() => 0.95505 (baseline)
naive_bayes.MultinomialNB(alpha=0.0001) => 0.99569
naive_bayes.ComplementNB(alpha=0.0001) => 0.99507
naive_bayes.BernoulliNB(alpha=0.0001) => 0.98830

Income Dataset - N-Fold Cross-Validation Acc

naive_bayes.GaussianNB() => 0.58602 (baseline)
naive_bayes.MultinomialNB(alpha=0.959) => 0.79148
naive_bayes.ComplementNB(alpha=0.16) => 0.74992
naive_bayes.BernoulliNB(alpha=0.001) => 0.75760

Result - Decision Tree Performance

News Dataset - Testing Set Acc

tree.DecisionTreeClassifier(criterion='gini', splitter='random', random_state=1337, max_depth=64) => 0.64895
decision tree visualization with the graphviz toolkit:

Mushroom Dataset - Testing Set Acc

tree.DecisionTreeClassifier(criterion='gini', splitter='random', random_state=1337, max_depth=64) => 1.0
decision tree visualization with the graphviz toolkit:

Income Dataset - N-Fold Cross-Validation Acc

tree.DecisionTreeClassifier(criterion='entropy', max_depth=15, min_impurity_decrease=2e-4) => 0.83554
decision tree visualization with the graphviz toolkit:

Data Preprocessing

News Dataset Preprocessing

None, raw input

Mushroom Dataset Preprocessing

22 categorical attributes are transformed into a 117 dimension one-hot feature vector
Resulting data shape:

Income Dataset Preprocessing

Specify each entry to either one of the data type: (int, str)
Identify all missing entries '?' and replace them with np.nan
Impute and estimate all missing entries:
- If dtype is int: impute with mean value of the feature column
- If dtype is str: impute with most frequent item in the feature column
Split data into categorical and continuous and process them separately:
- categorical features index = [1, 3, 5, 6, 7, 8, 9, 13]
- continuous features index = [0, 2, 4, 10, 11, 12]
For categorical data:
- 8 categorical attributes are transformed into a 99 dimension one-hot feature vector
For continuous data:
- Normalize with maximum norm of that feature column
Re-concatenate categorical features and continuous features, the resulting data shape:

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
image		image
result		result
src		src
LICENSE		LICENSE
Readme.md		Readme.md
data.zip		data.zip
problem_description.pdf		problem_description.pdf
report.pdf		report.pdf

License

andi611/Naive-Bayes-and-Decision-Tree-Classifiers

Folders and files

Latest commit

History

Repository files navigation

Data Mining: Naive Bayes and Decision Tree Classifiers

Environment

File Description

Usage

Data

Naive Bayes Classifier

Decision Tree Classifier

Result - Naive Bayes Performance

News Dataset - Testing Set Acc

Mushroom Dataset - Testing Set Acc

Income Dataset - N-Fold Cross-Validation Acc

Result - Decision Tree Performance

News Dataset - Testing Set Acc

Mushroom Dataset - Testing Set Acc

Income Dataset - N-Fold Cross-Validation Acc

Data Preprocessing

News Dataset Preprocessing

Mushroom Dataset Preprocessing

Income Dataset Preprocessing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages