CURE

This repository implements the clustering algorithm Clustering via Uncoupled REgression (CURE) from Wang's paper Efficient Clustering for Stretched Mixtures: Landscape and Optimality (NeurIPS 2020). It also conducts several experiments to benchmark the performance of CURE.

Explanation

Motivation

Many traditional clustering algorithms struggle to cluster elliptically distributed data. K-Means, for example, assumes the data is spherically distributed and performs poorly when data is elliptically distributed. CURE seeks to solve this problem by creating a clustering algorithm that excels at clustering elliptically distributed data.

Loss Function

CURE seeks to find the weights that minimize the loss function

$\boldsymbol{\beta}^* = \underset{ \boldsymbol{\beta} \in \mathbb{R}^{d}}{\arg\min} \bigg\{ \frac{1}{n} \sum_{i=1}^{n} f( \boldsymbol{\beta}^T \boldsymbol{X_i}) + \frac{1}{2} (\boldsymbol{\beta}^T \hat{\boldsymbol{\mu_0}})^2 \bigg\}$

where $\boldsymbol{\beta} \in \mathbb{R}^d$ are the weights, $\boldsymbol{\beta}^{*} \in \mathbb{R}^d$ are the optimal weights that minimize the above equation, $\boldsymbol{X_i} \in \mathbb{R}^d$ is a sample data point with features, $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ is a matrix of datapoints each with features, and $\hat{\boldsymbol{\mu_0}} = \frac{1}{n} \sum_{i=1}^{n} \boldsymbol{X_i}$ is the value of the average data point. To get $\boldsymbol{X}$ we preappend a column of ones $\boldsymbol{1} \in \mathbb{R}^n$ to the data (which is really $\mathbb{R}^{n \times (d - 1)}$ ) to give us an intercept term.

The discriminative function $f : \mathbb{R} \rightarrow \mathbb{R}$ is defined as

$f(x)=\begin{cases} h(x) & |x| \leq a \\ f(a) + h'(a) (|x| - a) + \frac{h''(a)}{2} (|x| - a)^2 - \frac{h''(a)}{6(b-a)} (|x| - a)^3 & a < |x| \leq b \\ f(b) + \Big[ h(a) + \frac{b-a}{2} h''(a) \Big] (|x| - b) & |x| > b \\ \end{cases}$

where and .

We embed the data $\boldsymbol{X}$ into 1D space via the dot product $x = \boldsymbol{\beta}^T \boldsymbol{X_i}$ . Then we plug into the discriminative function to separate, ie. discriminate, the embedded data into two clusters. How does this work? and both have minimums at both =±1 so minimizing these equations will map many of our datapoints to and many of them to , resulting in two different clusters. Because becomes huge for large values ( is quartic), we construct which is just like except that when is too big we clip its growth with linear functions. More explitically, has three parts. When is small, $|x| \leq a$ , we will minimize which has two valleys around ±1. When is too big, , we will minimize a linear function so our values don't blow up. When is somewhere in between, $a < |x| \leq b$ , we use a cubic spline to connect the valleys to the linear function.

So $1/n \sum\nolimits_{i=1}^{n} f( \boldsymbol{\beta}^T \boldsymbol{X_i})$ embeds the data $\boldsymbol{X}$ in 1D space and computes on average how well the data is separated into two clusters located at =±1. The lower this value, the better. However, we can minimize this function by clustering all of the datapoints into a single cluster. To avoid this trivial solution, we use the penalty term $0.5 (\boldsymbol{\beta}^T \hat{\boldsymbol{\mu_0}})^2$ . This term encourages the data $\boldsymbol{X}$ to be evenly split between the two clusters at =±1 as this term has the lowest value when $\hat{\boldsymbol{\mu_0}} = 0$ which only occurs when the data is evenly split between both clusters.

Putting it all together, the loss function embeds the data $\boldsymbol{X}$ in 1D space and computes on average how well the data is separated into two evenly sized clusters located at x=±1. We minimize this score and record the weights, $\boldsymbol{\beta}^*$ , that result in this minimization.

Clustering

Once we have computed $\boldsymbol{\beta}^*$ , the clustering comes into play: $\hat{y}_i = \text{sgn}( \boldsymbol{\beta}^* \boldsymbol{X_i} )$ where $\hat{y}_i$ is the predicted label of the th sample for all datapoints $i = 1, \hdots, n$ . The $\text{sgn}()$ function puts all positive datapoints into one cluster and all negative datapoints into a different cluster. And these are our clusters! That's it.

Coding Overview

I designed CURE so it resembles many of the classifiers in scikit-learn. And I did this by creating a cure class with __init__ and three methods.

init

When you initialize the class, you specify the the random seed and the parameters a, b.

fit()

When you call fit, it returns the weights that result in the best performance by minimizing the loss function. It relies upon scipy's minimization function.

predict()

When you call predict, it uses the weights computed in fit to predict a label for each datapoint. It uses the equation $\hat{y}_i = \text{sgn}( \boldsymbol{\beta}^* \boldsymbol{X_i} )$ .

fit_predict()

This calls both fit and predict on the training data.

Experiments

Experiment 1

For our first experiment, let's use CURE to cluster some elliptically distributed data.

True Clustering	CURE Clustering

From these two plots, it seems like CURE pretty perfectly predicts which datapoint belongs to which cluster. This is a great visual check that everything is working.

Next, we look at the adjusted rand index (ARI), a measure of the similarity between two different data clusterings (here, and $\hat{y}_i$ for all datapoints $i = 1, \hdots, n$ ) that is adjusted for the chance grouping of elements. This is the clustering analogue for accuracy with a lower bound of -1 and an upper bound of 1; an ARI of 0 corresponds to the average clustering, ie a random guess. Our clustering achieved an adjusted rand index of 0.98 which is pretty great!

And finally we have the misclassification rate which is a tiny 0.4% or 0.004. It seems like we may have made a few incorrect predictions by the border between the data. Nonetheless, CURE clearly does a great job of clustering on this elliptically distributed data.

Experiment 2

Experiment 2 compares the performance of CURE and many other clustering algorithms on various datasets.

The clustering algorithms tested are:

CURE
KMeans
Meanshift
Spectral clustering
Ward
Agglomerative Clustering
DBSCAN
OPTICS
BIRCH
Gaussian Mixture

Then for every dataset and algorithm, I recorded the time it took for the algorithm to run and the Adjusted Rand Index (ARI).

A couple of observations:

CURE had near-constant runtime, regardless of the distribution of the data.
CURE achieved perfect classification on the two elliptically distributed datasets. In other words, it did exactly what it was suppossed to do.
CURE only performed well on clusters that were linearly seperable, ie it performed quite poorly on the first two datasets. This suggests that CURE might only work well on data with a linear decision boundary.
Generally speaking, CURE was on par with the rest of the well known clustering algorithms. However, Spectral Clustering performed the best overall.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reports

reports

src

src

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

CURE

Table of Contents

Explanation

Motivation

Loss Function

Clustering

Coding Overview

init

fit()

predict()

fit_predict()

Experiments

Experiment 1

Experiment 2

A couple of observations:

Animation of CURE

About

Releases

Packages

Languages

ez2rok/cure

Folders and files

Latest commit

History

Repository files navigation

CURE

Table of Contents

Explanation

Motivation

Loss Function

Clustering

Coding Overview

__init__

fit()

predict()

fit_predict()

Experiments

Experiment 1

Experiment 2

A couple of observations:

Animation of CURE

About

Resources

Stars

Watchers

Forks

Languages

init