Skip to content

avouros/clustering-workplace

Repository files navigation

This code was written for quick experimentation on different clustering techniques.

TODO:

  • Write a wiki!

  • Write some tests.

  • Convert the rest of GUIDE to App Design.

  • Release full version.

Notes

  • To use this code the Statistics and Machine Learning Toolbox for MATLAB is required.

  • For the Hartigan and Wong's K-Means the NAG Toolbox for MATLAB is required.

  • Files contain further information about relevant studies and links to online material used for the various formulas.

  • All the R packages/codes used for the MATLAB implementation are under the GPLv3 license.

Contents

Clustering initialization

  • Random points, pick random datapoints.
  • First points, pick the first datapoints of the dataset.
  • K-Means++, pick datapoints away from each other.
  • ROBIN(S), pick datapoints away from each other and also in dense regions of the feature space. Density is computed using the LOF score.
  • ROBIN(D), original determinitic version of ROBIN.
  • Kaufman, pick datapoints away from each other close to dense regions of the feature space.
  • Density K-Means++, same as ROBIN but deterministic and uses another statistic to find density based on minimum spanning trees.

Clustering algorithms

  • K-Means (Lloyd), the common K-Means algorithm.
  • K-Means (Hartigan-Wong), available only with NAG Toolbox for MATLAB.
  • K-Medians, similar to K-Means but uses the median instead of the mean to update the centroids.
  • Sparse K-Means, K-Means with feature selection and assessment mechanism

External clustering validation

  • entropy
  • purity
  • F-score
  • accuracy
  • recall
  • specificity
  • precision

Internal clustering validation

  • DaviesBouldinIndex (DBi)
  • BanfieldRafteryIndex (BRi)
  • CalinskiHarabaszIndex (CHi)
  • Silhouette (Silh2 and Silh), Silh2 is computed by taking the mean silhouette of the datapoints, Silh is computed by taking the mean silhouette of the clusters.

Note: In the case of Sparse K-Means, indexes with 'w' (e.g. wSilh2) have been computed using the weighted dataset. For the rest of the algorithms index with 'w' should have the same value as the ones without (e.g. wSilh2 = Silh2).

Relevant publications

Vouros, A., Langdell, S., Croucher, M., & Vasilaki, E. (2019). An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations. arXiv preprint arXiv:1908.09946.

Citations for software and datasets that we have used in this project

Datasets

Clustering basic benchmark:

Fränti, Pasi, and Sami Sieranoja. "K-means properties on six clustering benchmark datasets." Applied Intelligence 48.12 (2018): 4743-4759.

Real datasets:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Gap models:

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. "Estimating the number of clusters in a data set via the gap statistic." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.

Weighted Gap models (YanYe):

Yan, Mingjin, and Keying Ye. "Determining the number of clusters using the weighted gap statistic." Biometrics 63.4 (2007): 1031-1037.

Brodinova dataset generator:

Brodinová, Šárka, et al. "Robust and sparse k-means clustering for high-dimensional data." Advances in Data Analysis and Classification (2017): 1-28.

MATLAB code was based on the R implementation of the algorithm; package: wrsk

Mixed dataset models:

Vouros, Avgoustinos, et al. "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations." arXiv preprint arXiv:1908.09946 (2019).

Clustering Algorithms

K-Means (Lloyd and Hartigan-Wong):

MATLAB's and Python's default K-Means clustering is Lloyd's K-Means (initialized with the K-Means++ method) while R uses Hartigan-Wong' K-Means. For more information about these two algorithms refer to Slonim, N., Aharoni, E., & Crammer, K. (2013, June). Hartigan's K-Means Versus Lloyd's K-Means—Is It Time for a Change?. In Twenty-Third International Joint Conference on Artificial Intelligence. and for a comparison to Vouros, Avgoustinos, et al. "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations." arXiv preprint arXiv:1908.09946 (2019).. Here we use NAG Toolbox for MATLAB Hartigan and Wong's K-Means implementation thus in order to use this algorithm the toolbox is required.

Sparse K-Means:

Witten, Daniela M., and Robert Tibshirani. "A framework for feature selection in clustering." Journal of the American Statistical Association 105.490 (2010): 713-726.

MATLAB code was based on the R implementation of the algorithm; package: sparcl

Clustering Initialization

Random points and First points:

These old K-Means initialization methods are described in MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297). Random points just picks K random points of the dataset as initial centroids; First points just selects the first K points of the dataset as initial centroids.

K-Means++:

MATLAB implementation was based on the instructions of the MSDN Magazine Blog: Test Run - K-Means++ Data Clustering

ROBIN:

MATLAB code was originally based on the R implementation of the algorithm; package: wrsk

Kaufman:

MATLAB implementation was based on the pseudocode of Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10), 1027-1040.

Density K-Means++:

Nidheesh, N., KA Abdul Nazeer, and P. M. Ameer. "An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data." Computers in biology and medicine 91 (2017): 213-221.

MATLAB code was based on the R implementation of the algorithm; code: dkmpp_0.1.0

About

A gui for running different K-Means clustering techniques on benchmark datasets

Resources

License

Stars

Watchers

Forks

Packages

No packages published