This code was written for quick experimentation on different clustering techniques.
-
Write a wiki!
-
Write some tests.
-
Convert the rest of GUIDE to App Design.
-
Release full version.
-
To use this code the Statistics and Machine Learning Toolbox for MATLAB is required.
-
For the Hartigan and Wong's K-Means the NAG Toolbox for MATLAB is required.
-
Files contain further information about relevant studies and links to online material used for the various formulas.
-
All the R packages/codes used for the MATLAB implementation are under the GPLv3 license.
- Random points, pick random datapoints.
- First points, pick the first datapoints of the dataset.
- K-Means++, pick datapoints away from each other.
- ROBIN(S), pick datapoints away from each other and also in dense regions of the feature space. Density is computed using the LOF score.
- ROBIN(D), original determinitic version of ROBIN.
- Kaufman, pick datapoints away from each other close to dense regions of the feature space.
- Density K-Means++, same as ROBIN but deterministic and uses another statistic to find density based on minimum spanning trees.
- K-Means (Lloyd), the common K-Means algorithm.
- K-Means (Hartigan-Wong), available only with NAG Toolbox for MATLAB.
- K-Medians, similar to K-Means but uses the median instead of the mean to update the centroids.
- Sparse K-Means, K-Means with feature selection and assessment mechanism
- entropy
- purity
- F-score
- accuracy
- recall
- specificity
- precision
- DaviesBouldinIndex (DBi)
- BanfieldRafteryIndex (BRi)
- CalinskiHarabaszIndex (CHi)
- Silhouette (Silh2 and Silh), Silh2 is computed by taking the mean silhouette of the datapoints, Silh is computed by taking the mean silhouette of the clusters.
Note: In the case of Sparse K-Means, indexes with 'w' (e.g. wSilh2) have been computed using the weighted dataset. For the rest of the algorithms index with 'w' should have the same value as the ones without (e.g. wSilh2 = Silh2).
Clustering basic benchmark:
Real datasets:
Gap models:
Weighted Gap models (YanYe):
Brodinova dataset generator:
MATLAB code was based on the R implementation of the algorithm; package: wrsk
Mixed dataset models:
K-Means (Lloyd and Hartigan-Wong):
MATLAB's and Python's default K-Means clustering is Lloyd's K-Means (initialized with the K-Means++ method) while R uses Hartigan-Wong' K-Means. For more information about these two algorithms refer to Slonim, N., Aharoni, E., & Crammer, K. (2013, June). Hartigan's K-Means Versus Lloyd's K-Means—Is It Time for a Change?. In Twenty-Third International Joint Conference on Artificial Intelligence. and for a comparison to Vouros, Avgoustinos, et al. "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations." arXiv preprint arXiv:1908.09946 (2019).. Here we use NAG Toolbox for MATLAB Hartigan and Wong's K-Means implementation thus in order to use this algorithm the toolbox is required.
Sparse K-Means:
MATLAB code was based on the R implementation of the algorithm; package: sparcl
Random points and First points:
These old K-Means initialization methods are described in MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297). Random points just picks K random points of the dataset as initial centroids; First points just selects the first K points of the dataset as initial centroids.
K-Means++:
MATLAB implementation was based on the instructions of the MSDN Magazine Blog: Test Run - K-Means++ Data Clustering
ROBIN:
MATLAB code was originally based on the R implementation of the algorithm; package: wrsk
Kaufman:
MATLAB implementation was based on the pseudocode of Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10), 1027-1040.
Density K-Means++:
MATLAB code was based on the R implementation of the algorithm; code: dkmpp_0.1.0