KNN ALGORITHM.

About KNN classifiers.

KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. Becouse of this the scale of variables in such a dataset is very important. Variables on a large scale will have a larger effect on the distance between the observations which also affects the KNN clasifier too.

An intutive way to handle the scalling problem in KNN classification is to standerdize the the dataset in such a way that all variables ae given a mean of zero and a sd of 1. Training algorithm:

Store all the data

Prediction Algorithm:

Calculate the distance from x to all points in your data.
Sort the points in your data by increasing the distance from x.
Predict the majority label of the "k" closest points.

About this project.

This project objects to classify the observations with respect to a target varaiable indicated at last variable.Its important to note that this one of the anonymized datasets provided by clients.This could be because of the need to protect sensitive information.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('annonimizeddataset',index_col = 0)

data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ	TARGET CLASS
0	0.913917	1.162073	0.567946	0.755464	0.780862	0.352608	0.759697	0.643798	0.879422	1.231409	1
1	0.635632	1.003722	0.535342	0.825645	0.924109	0.648450	0.675334	1.013546	0.621552	1.492702	0
2	0.721360	1.201493	0.921990	0.855595	1.526629	0.720781	1.626351	1.154483	0.957877	1.285597	0
3	1.234204	1.386726	0.653046	0.825624	1.142504	0.875128	1.409708	1.380003	1.522692	1.153093	1
4	1.279491	0.949750	0.627280	0.668976	1.232537	0.703727	1.115596	0.646691	1.463812	1.419167	1

So here the data is anonymized with meaningless labes as the raw labels.The last class is the target class which needs to be predicted.

Data exploration analysis

data.columns

Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
       'TARGET CLASS'],
      dtype='object')

data.shape

(1000, 11)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
WTT             1000 non-null float64
PTI             1000 non-null float64
EQW             1000 non-null float64
SBI             1000 non-null float64
LQE             1000 non-null float64
QWG             1000 non-null float64
FDJ             1000 non-null float64
PJF             1000 non-null float64
HQE             1000 non-null float64
NXJ             1000 non-null float64
TARGET CLASS    1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 93.8 KB

sns.heatmap(data.isnull(),yticklabels=False,cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1a25531a20>

The graph above shows clearly that there is no missing data in the set above.

Scalling Variables.

As pointed out earlier ,scalling the variables is very important in KNN .

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(data.drop('TARGET CLASS',axis = 1))

StandardScaler(copy=True, with_mean=True, with_std=True)

scaled_feat = scaler.transform(data.drop('TARGET CLASS',axis = 1))

data_feat = pd.DataFrame(scaled_feat,columns=data.columns[:-1])

data_feat.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ
0	-0.123542	0.185907	-0.913431	0.319629	-1.033637	-2.308375	-0.798951	-1.482368	-0.949719	-0.643314
1	-1.084836	-0.430348	-1.025313	0.625388	-0.444847	-1.152706	-1.129797	-0.202240	-1.828051	0.636759
2	-0.788702	0.339318	0.301511	0.755873	2.031693	-0.870156	2.599818	0.285707	-0.682494	-0.377850
3	0.982841	1.060193	-0.621399	0.625299	0.452820	-0.267220	1.750208	1.066491	1.241325	-1.026987
4	1.139275	-0.640392	-0.709819	-0.057175	0.822886	-0.936773	0.596782	-1.472352	1.040772	0.276510

Splitting data into train and test split

from sklearn.model_selection import train_test_split

X = data_feat

y = data['TARGET CLASS']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

KNN model deployment.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

predictions = knn.predict(X_test)

Model Evaluation

from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))

[[134   8]
 [ 11 147]]
____________________________________________________________
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       142
           1       0.95      0.93      0.94       158

   micro avg       0.94      0.94      0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300

this gives an accuracy of 94%.

Using the Elbow method in Improving the model.

This proces aims to extract more information by chosing a beter k value.The process will also try to iterate over many more different k values and plot their error rates.This will enable me to see which one has the lowest error rate.

errorRate = []

for kvalue in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=kvalue)
    knn.fit(X_train,y_train)
    predictions = knn.predict(X_test)
    errorRate.append(np.mean(predictions != y_test)) # average error rate

plt.figure(figsize=(10,6))
plt.plot(range(1,40),errorRate,color = "blue",linestyle = "dashed",marker = 'o')

[<matplotlib.lines.Line2D at 0x1a2528e518>]

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))

---------------------------------------------------------------------------

This gives a small improvement in accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
output_31_1.png		output_31_1.png
output_8_1.png		output_8_1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

output_31_1.png

output_31_1.png

output_8_1.png

output_8_1.png

Repository files navigation

KNN ALGORITHM.

About KNN classifiers.

About this project.

Data exploration analysis

Scalling Variables.

Splitting data into train and test split

KNN model deployment.

Model Evaluation

Using the Elbow method in Improving the model.

About

Releases

Packages

License

GeorgeOduor/KNN-Algorithm

Folders and files

Latest commit

History

Repository files navigation

KNN ALGORITHM.

About KNN classifiers.

About this project.

Data exploration analysis

Scalling Variables.

Splitting data into train and test split

KNN model deployment.

Model Evaluation

Using the Elbow method in Improving the model.

About

Resources

License

Stars

Watchers

Forks