The purpose of the assignment was used to process the raw MYOPIA data to fit the machine learning models. Several clustering algorithms were used to explore if the patients can be placed into distinct groups of patients. This would help us to analyze them separately and to find better ways to predict myopia, or nearsightedness.
-
Used Pandas DataFrame to read
myopia.csv
. -
Removed the "MYOPIC" column from the dataset.
-
Verified if the data has any "Nulls" or duplicates
-
Standardize the dataset (using StandardScaler) so that columns that contain larger values do not influence the outcome more than columns with smaller values.
-
Performed dimensionality reduction with PCA. This reduced the number of columns from 14 to 10 features.
- preserved 90% of the explained variance in dimensionality reduction.
-
Further reduced the dataset dimensions with t-SNE.
-
Created a scatter plot of the t-SNE output. Looks like there are 5 distinct clusters.
Created an elbow plot to identify the best number of clusters.
-
Used a
for
loop to determine the inertia for eachk
between 1 through 10. -
Determined where the elbow of the plot is, and at which value of
k
it appears.