GitHub - avouros/clustering-workplace: A gui for running different K-Means clustering techniques on benchmark datasets

This code was written for quick experimentation on different clustering techniques.

TODO:

Write a wiki!
Write some tests.
Convert the rest of GUIDE to App Design.
Release full version.

Notes

To use this code the Statistics and Machine Learning Toolbox for MATLAB is required.
For the Hartigan and Wong's K-Means the NAG Toolbox for MATLAB is required.
Files contain further information about relevant studies and links to online material used for the various formulas.
All the R packages/codes used for the MATLAB implementation are under the GPLv3 license.

Random points, pick random datapoints.
First points, pick the first datapoints of the dataset.
K-Means++, pick datapoints away from each other.
ROBIN(S), pick datapoints away from each other and also in dense regions of the feature space. Density is computed using the LOF score.
ROBIN(D), original determinitic version of ROBIN.
Kaufman, pick datapoints away from each other close to dense regions of the feature space.
Density K-Means++, same as ROBIN but deterministic and uses another statistic to find density based on minimum spanning trees.

Clustering algorithms

K-Means (Lloyd), the common K-Means algorithm.
K-Means (Hartigan-Wong), available only with NAG Toolbox for MATLAB.
K-Medians, similar to K-Means but uses the median instead of the mean to update the centroids.
Sparse K-Means, K-Means with feature selection and assessment mechanism

External clustering validation

entropy
purity
F-score
accuracy
recall
specificity
precision

Internal clustering validation

DaviesBouldinIndex (DBi)
BanfieldRafteryIndex (BRi)
CalinskiHarabaszIndex (CHi)
Silhouette (Silh2 and Silh), Silh2 is computed by taking the mean silhouette of the datapoints, Silh is computed by taking the mean silhouette of the clusters.

Note: In the case of Sparse K-Means, indexes with 'w' (e.g. wSilh2) have been computed using the weighted dataset. For the rest of the algorithms index with 'w' should have the same value as the ones without (e.g. wSilh2 = Silh2).

Relevant publications

Vouros, A., Langdell, S., Croucher, M., & Vasilaki, E. (2019). An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations. arXiv preprint arXiv:1908.09946.

Citations for software and datasets that we have used in this project

Datasets

Clustering basic benchmark:

Fränti, Pasi, and Sami Sieranoja. "K-means properties on six clustering benchmark datasets." Applied Intelligence 48.12 (2018): 4743-4759.

Real datasets:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Gap models:

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. "Estimating the number of clusters in a data set via the gap statistic." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.

Weighted Gap models (YanYe):

Yan, Mingjin, and Keying Ye. "Determining the number of clusters using the weighted gap statistic." Biometrics 63.4 (2007): 1031-1037.

Brodinova dataset generator:

Brodinová, Šárka, et al. "Robust and sparse k-means clustering for high-dimensional data." Advances in Data Analysis and Classification (2017): 1-28.

MATLAB code was based on the R implementation of the algorithm; package: wrsk

Mixed dataset models:

Vouros, Avgoustinos, et al. "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations." arXiv preprint arXiv:1908.09946 (2019).

Clustering Algorithms

K-Means (Lloyd and Hartigan-Wong):

MATLAB's and Python's default K-Means clustering is Lloyd's K-Means (initialized with the K-Means++ method) while R uses Hartigan-Wong' K-Means. For more information about these two algorithms refer to Slonim, N., Aharoni, E., & Crammer, K. (2013, June). Hartigan's K-Means Versus Lloyd's K-Means—Is It Time for a Change?. In Twenty-Third International Joint Conference on Artificial Intelligence. and for a comparison to Vouros, Avgoustinos, et al. "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations." arXiv preprint arXiv:1908.09946 (2019).. Here we use NAG Toolbox for MATLAB Hartigan and Wong's K-Means implementation thus in order to use this algorithm the toolbox is required.

Sparse K-Means:

Witten, Daniela M., and Robert Tibshirani. "A framework for feature selection in clustering." Journal of the American Statistical Association 105.490 (2010): 713-726.

MATLAB code was based on the R implementation of the algorithm; package: sparcl

Clustering Initialization

Random points and First points:

These old K-Means initialization methods are described in MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297). Random points just picks K random points of the dataset as initial centroids; First points just selects the first K points of the dataset as initial centroids.

K-Means++:

MATLAB implementation was based on the instructions of the MSDN Magazine Blog: Test Run - K-Means++ Data Clustering

ROBIN:

MATLAB code was originally based on the R implementation of the algorithm; package: wrsk

Kaufman:

MATLAB implementation was based on the pseudocode of Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10), 1027-1040.

Density K-Means++:

Nidheesh, N., KA Abdul Nazeer, and P. M. Ameer. "An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data." Computers in biology and medicine 91 (2017): 213-221.

MATLAB code was based on the R implementation of the algorithm; code: dkmpp_0.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
GUI		GUI
clustering		clustering
clustering_assessment		clustering_assessment
datasets		datasets
execution		execution
unused		unused
utilities		utilities
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
clustering_gui_App.mlapp		clustering_gui_App.mlapp

License

avouros/clustering-workplace

Folders and files

Latest commit

History

Repository files navigation

TODO:

Notes

Contents

Clustering initialization

Clustering algorithms

External clustering validation

Internal clustering validation

Relevant publications

Citations for software and datasets that we have used in this project

Datasets

Clustering Algorithms

Clustering Initialization

About

Resources

License

Stars

Watchers

Forks

Languages