GitHub - avouros/Code-KMeans-benchmark: Code for the manuscript: "An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations"

For the latest updates as well as for a standalone application capable of reproducing this work please refer to the PhD-additions branch of this repository.

This repository contains the code used to produce the results of the manuscript: An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations.

Generate manuscript results

To reproduce the results of the paper run nag_init_methods_generate.m which will produce the folder NAG_init_res. This folder will be populated with several files in the format: [modelName]_model_[setNo].mat (for each clustering algorithm a subfolder will be created). Each of these files contains a struct array (rows: initialisation method, columns: number of data sets - default: 40). For stochastic methods multiple values are available based on the repetition of the clustering solution (default: 50).

Struct fields:

centers: initial clustering centroids.
idx: cluster index of each data point.
centroids: final clustering centroids.
iterations: number of iterations until convergence (not for Hartigan-Wong).
Silh2 & Silh: silhouette index of the clustering solution (Silh2 was used in the manuscript).
wcd: within-cluster-distance.
bcd: between-cluster-distance.
clPur: purity index of the clustering solution.

nag_init_methods_timing.m runs the execution time analysis. It requires the folder NAG_init_res generated by nag_init_methods_generate.m.

The results of nag_init_methods_generate.m and nag_init_methods_timing.m where then used to generate the figures and tables of the manuscript. For the results of a previous version of the manuscript (arXiv: https://arxiv.org/abs/1908.09946) code for plotting is provided, refer to commit 9e9a00

Current dependencies:

The results were produced using MATLAB R2018b with the following toolboxes:

Statistics and Machine Learning Toolbox
Parallel Computing Toolbox (if this toolbox is not available execution is still possible).
The NAG Toolbox for MATLAB to run Hartigan and Wong's K-Means.

Citations for software and datasets that we have used in this project

Datasets

Clustering basic benchmark:

Fränti, Pasi, and Sami Sieranoja. "K-means properties on six clustering benchmark datasets." Applied Intelligence 48.12 (2018): 4743-4759.

Gap models:

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. "Estimating the number of clusters in a data set via the gap statistic." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.

Weighted Gap models (YanYe):

Yan, Mingjin, and Keying Ye. "Determining the number of clusters using the weighted gap statistic." Biometrics 63.4 (2007): 1031-1037.

Brodinova dataset generator:

Brodinová, Šárka, et al. "Robust and sparse k-means clustering for high-dimensional data." Advances in Data Analysis and Classification (2017): 1-28.

MATLAB code was based on the R implementation of the algorithm; package: wrsk

Real datasets

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Clustering Algorithms

Hartigan-Wong K-Means:

MATLAB's and Python's default K-Means clustering is Lloyd's K-Means (initialized with the K-Means++ method) while R uses Hartigan-Wong' K-Means. Here we use NAG Toolbox for MATLAB Hartigan and Wong's K-Means implementation thus in order to use this algorithm the toolbox is required.

Clustering Initialization

Random:

MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).

K-Means++:

Original algorithm is described in the study of Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding, p 1027–1035. In SODA'07: proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA..

MATLAB implementation was based on the instructions of the MSDN Magazine Blog: Test Run - K-Means++ Data Clustering

Maximin:

Stochastic version is based on the study of Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38, 293-306..

Deterministic version is based on the study of: Katsavounidis, I., Kuo, C. C. J., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal processing letters, 1(10), 144-146..

ROBIN:

Original algorithm (deterministic) is described in the study of Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. J. (2009). Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11), 994-1002..

MATLAB code was originally based on the R implementation of the algorithm (stochastic version); package: wrsk

Density K-Means++:

Nidheesh, N., KA Abdul Nazeer, and P. M. Ameer. "An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data." Computers in biology and medicine 91 (2017): 213-221.

MATLAB code was based on the R implementation of the algorithm; code: dkmpp_0.1.0

Kaufman:

MATLAB implementation was based on the pseudocode of Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10), 1027-1040.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
LICENSE.md		LICENSE.md
README.md		README.md
README.md.backup		README.md.backup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

LICENSE.md

LICENSE.md

README.md

README.md

README.md.backup

README.md.backup

Repository files navigation

Generate manuscript results

Current dependencies:

Citations for software and datasets that we have used in this project

Datasets

Clustering Algorithms

Clustering Initialization

About

Releases

Packages

Languages

License

avouros/Code-KMeans-benchmark

Folders and files

Latest commit

History

Repository files navigation

Generate manuscript results

Current dependencies:

Citations for software and datasets that we have used in this project

Datasets

Clustering Algorithms

Clustering Initialization

About

Resources

License

Stars

Watchers

Forks

Languages