GitHub - sueszli/k-means: naive k-means clustering from scratch in vanilla python

┌─────────────────────────────────────────────┐
| can someone please tell me what 'k' means?  |
└─────────────────────────────────────────────┘
    \
     \
      \       _.-.
         __.-' ,  \
        '--'-'._   \
                '.  \
                 )-- \ __.--._
                /   .'        '--.
               .               _/-._
               :       ____._/   _-'
               '._          _.'-'
                  '-._    _.'
                      \_|/
                     __|||
                     >__/'

k-means clustering is an unsupervised learning algorithm, meaning it works with unlabeled data. it's a partitioning clustering method where each data point definitively belongs to a single cluster. finding the optimal solution in k-means is an np-hard problem, so we use iterative refinement to get a good approximation. for this example, we'll use the classic iris dataset, even though naive implementations of k-means often fail due to the dataset's characteristics. since the iris dataset has three flower species, we'll aim for three clusters (k=3). finally, we'll compare our results to the standard k-means implementation in scikit-learn.

algorithm:

select the number of clusters, $k$.
place $k$ random vectors in the space of the data points, called "centroids".

k-means is sensitive to the initial placement of the centroids. the "forgy method" is simple and effective but "k-means++" is the more sophisticated approach.
calculate the distance between every data point and centroid (i.e. euclidean distance).
assign each data point to the cluster of the closest centroid, forming $k$ clusters.
update the position of the centroids. set them to the average position of the data points in the cluster.
repeat until the centroids stop moving (or some other criteria).

avoid local minima by running the algorithm multiple times with different initializations.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
k-means.py		k-means.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

k-means.py

k-means.py

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

About

Languages

License

sueszli/k-means

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages