Building multi models to classify numerical data of Gamma, Hadron dataset.
It was done by undersampling to let the two classes be balanced.
The dataset was split into
- 70% train
- 30% test
It is noticed that the ranges of the features are not close, so MinMaxScaler is applied on the data.
We trained all the following machine learning algorithms:
-
Decision Tree Classifier
No hyperparameter tuning was done.
The metrics for Decision Tree are as the following:- Accuracy= 79.09
- Precision= 79.19
- Recall= 78.91
- F1 score= 79.05
- Specificity= 79.26
-
Naive Bayes
No hyperparameter tuning was done.
The metrics for Naive-Bayes are as the following:- Accuracy= 63.71
- Precision= 59.20
- Recall= 88.19
- F1 score= 70.84
- Specificity= 39.23
The poor performance for Naive-Bayes is because of the naive assumption that the features are independent, but the data description shows that the features are dependent and nearly extracted from each others.
-
K-Nearest Neighbors (KNN)
K-Fold Cross Validation was used to tune the value of K.
The metrics for KNN are as the following:- Accuracy= 81.58
- Precision= 77.90
- Recall= 88.19
- F1 score= 82.73
- Specificity= 74.98
-
AdaBoost Classifier
K-Fold Cross Validation was used to tune the value of n-estimators.
The metrics for AdaBoost are as the following:- Accuracy= 81.93
- Precision= 81.79
- Recall= 82.15
- F1 score= 81.97
- Specificity= 81.70
-
Random Forest Classifier
K-Fold Cross Validation was used to tune the value of n-estimators.
The metrics for Random Forest are as the following:- Accuracy= 86.44
- Precision= 84.88
- Recall= 88.68
- F1 score= 86.74
- Specificity= 84.20
-
PyTorch Double Layer ANN
The architecture of the neural network is shown in the following block:
class BinaryClassification(nn.Module):
def __init__(self, neurons1, neurons2):
super(BinaryClassification, self).__init__()
self.layer_1 = nn.Linear(10, neurons1)
self.layer_2 = nn.Linear(neurons1, neurons2)
self.layer_out = nn.Linear(neurons2, 1)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(p=0.1)
self.batchnorm1 = nn.BatchNorm1d(neurons1)
self.batchnorm2 = nn.BatchNorm1d(neurons2)
def forward(self, inputs):
x = self.relu(self.layer_1(inputs))
x = self.batchnorm1(x)
x = self.relu(self.layer_2(x))
x = self.batchnorm2(x)
x = self.dropout(x)
x = self.layer_out(x)
return x
K-Fold Cross Validation is used to tune the number of neurons in the first hidden layer and the second hidden layer.
The metrics for Random Forest are as the following:
- Accuracy= 85.74
- Precision= 89.27
- Recall= 81.26
- F1 score= 85.08
- Specificity= 90.23