Evaluation of Fast k-means

Introduction

This repo holds the source code and scripts for reproducing the key experiments of fast k-means evaluation. We also upload an exemplar dataset that you can play with in the folder "dataset".

Download our technical report here: https://github.com/tgbnhy/fast-kmeans/blob/master/unik-tr.pdf

Usage

If you run in Eclipse, just go to "edu.nyu.unik.expriments.kmeansEfficiency", and click the "run configuration", creat a new java application, and fill the following parameters:

./dataset/europediff_169300_2.txt 10 169300 a euro 0 1

There are six parameters:

arg[0] is the dataset file that you want to cluster
arg[1] is the number of clusters (k)
arg[2] is the number of points in the datafile which will be clustered (|D|)
arg[4] is the name of dataset to distinguish
arg[5] is the start dimension (column) in the dataset file
arg[6] is the end dimension (column) in the dataset file

Then, all the results will be recorded into the log files under the "logs" folder.

You can evaluate the index methods in a similar way, but change to the class "edu.nyu.unik.expriments.kmeansEfficiencyIndex", then it will run over the same dataset, and generate different log files.

If you want to run from commands (recommended):

mvn clean package

A file "torch-clus-0.0.1-SNAPSHOT.jar" will be generated under folder "target".

 java -Xmx16192M -cp ./target/torch-clus-0.0.1-SNAPSHOT.jar edu.nyu.unik.expriments.kmeansEfficiency ./dataset/europediff_169300_2.txt 10 169300 a euro 0 1

Auto-tuning

You can open and run our "Utune/utune.ipynb" and observe the precision based on various ML models.

Compared algorithms

We compared the following algorithms:

Algorithm	Paper	Year
Lloyd	Least squares quantization in PCM	1987
Ball-tree	The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data	2000
kd-tree	An efficient k-means clustering algorithm: Analysis and implementation	2002
Elkan	Using the triangle inequality to accelerate k-means	2003
Hamerly	Making k-means even faster	2010
Drake	Accelerated k-means with adaptive distance bounds	2012
Annulus	Faster k-means Clustering (Master thesis)	2013
Search	Scalable K-Means by ranked retrieval	2014
Yinyang	Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup	2015
Heap	Accelerating Lloyd’s Algorithm for k-Means Clustering	2015
Expo	Fast k-means with accurate bounds	2016
Drift	Geometric methods to accelerate k -means algorithms	2016
Vector	Speeding up k -means by approximating Euclidean distances via block vectors	2016
Regroup	Two Modifications of Yinyang K-means Algorithm	2017
Cover-tree	A Dual-Tree Algorithm for Fast k-means Clustering With Large k	2017
Pami20	A Fast Adaptive k-means with No Bounds	2020

Datasets

Any dataset with csv format can be clustered, just need to specify the columns that have numeric values. For example, you can download some datasets from UCI (https://archive.ics.uci.edu/ml/datasets.php), and put into the "dataset" folder.

Dataset	link	Dimension
BigCross	https://s3.amazonaws.com/h2o-training/clustering/BigCross.data.gz	57
Conflong	http://networkrepository.com/ConfLongDemo-JSI.php	3
Covtype	https://archive.ics.uci.edu/ml/datasets/covertype	55
Europe	http://cs.joensuu.fi/sipu/datasets/europediff.txt	2
KeggDirect	https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+(Directed)	24
KeggUndirect	https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Reaction+Network+(Undirected)	29
NYC-Taxi	https://drive.google.com/open?id=1CH_FPB4deDVFm_zDvyya5YMbFBOOz7Bn	2
Skin	https://archive.ics.uci.edu/ml/datasets/skin+segmentation	4
Power	https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption	9
RoadNetwork	https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark)	4
US-Census	https://archive.ics.uci.edu/ml/machine-learning-databases/census1990-mld/	68
Mnist	http://yann.lecun.com/exdb/mnist/	784

Results Interpretation

Log files

In the log file, according to the name, you can go to check multiple metrics we use. For example, in "logs/vldb_log1/euro_169300_2_10_BallMetric_30.log", you can see multiple lines. In each line, we present multiple sequential methods' corresponding performance. The first line shows the overall running time, for the other lines, please check class "kmeansAlgorihtm" function "writelogs" to see the metric if you are intereted in. For all the methods we tested, you can check function testExisting and function testIndex to see the full names.

When you evaluate the index methods, you can find newly generated log files: "euro_169300_2_30_index.log" which compares the construction time, and "euro_169300_2_10_30_clustering.log" which compares the clustering time.

Terminal

After you run the command, you will also observe logs from the terminal. It mainly shows the running time of each iteration calling various methods.

Citation

If you use our code for research work, please cite our paper as below:

@article{wang2019unik,
  title={On the Efficiency of {K}-Means Clustering: Evaluation, Optimization, and Algorithm Selection},
  author={Wang, Sheng and Sun, Yuan and Bao, Zhifeng},
  year={2020},
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Utune		Utune
dataset		dataset
logs		logs
seeds		seeds
src		src
target		target
README.md		README.md
UniK.png		UniK.png
appendix.pdf		appendix.pdf
dependency-reduced-pom.xml		dependency-reduced-pom.xml
evaluate_index.sh		evaluate_index.sh
evaluate_sequential.sh		evaluate_sequential.sh
pom.xml		pom.xml
time.txt		time.txt
unik-tr.pdf		unik-tr.pdf

tgbnhy/fast-kmeans

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Fast k-means

Introduction

Usage

Auto-tuning

Compared algorithms

Datasets

Results Interpretation

Log files

Terminal

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages