Skip to content

K-Means++ implementation for the .NET platform, includes Silhoutte k-estimator and Anderson-Darling statistical test.

Notifications You must be signed in to change notification settings

lofcz/sharpkmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Nuget

SharpKMeans

K-Means++ implementation for the .NET platform, includes Silhouette K-Estimator and Anderson-Darling statistical test.

Getting Started

Install-Package sharpkmeans

Create a dataset of N IEnumerable<float> items where each item represents an embedding in M-dimensional space. For example, this could be ada-002 embeddings of answers to some semi-open-ended questions. In the case of large M, consider first reducing the dimensions via UMAP as SharpKMeans uses Euclidean distance as its distance function which loses meaning in higher order dimensions fast.

SharpKMeans allows for the clusterization of the dataset if we expect it to have distinct groups. KMeans works best on spherical data, in the case of non-regular shapes, consider DBSCAN.

There are two routines available for this:

  • Evaluate(int clustersMin, int clustersMax, IEnumerable<IEnumerable<float>> data) if we don't know the exact K but we know a range in which K is.
  • Evaluate(int clusters, IEnumerable<IEnumerable<float>> data) if we know the exact K.

Both routines are thread-safe and can take an optional argument with settings of type KMeansSettings. The settings available are:

  • Iterations - increase the value if suboptimal clusters are found.
  • RequiredDifferenceBetweenIterations - allows to skip slow convergence near the end and end the algorithm eagerly if the centroids shift only by a very small amount.

An example of usage:

float[][] data = {
    new [] { 0f, 0.2f, 6f },
    new [] { 2.0f, 4f, 1.2f }
    // more data, the data should have at least two dimensions
    // for anderson-darling check, each cluster needs at least 5 datapoints
};

KMeansResultSilhouette[] result = KMeans.Evaluate(3, 20, data));
KMeansResult bestResult = result[0].Result;

The output structure contains:

  • Clusters - an array of clusters where each cluster is defined by its medoid
  • Datapoints - input datapoints assigned to the clusters
  • Convergence - the convergence progression report

An example of plotted results via ImageSharp, here K was inferred as 7:

image

Acknowledgments

About

K-Means++ implementation for the .NET platform, includes Silhoutte k-estimator and Anderson-Darling statistical test.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages