KMeansClusterer

k-means clustering in Ruby. Uses NArray under the hood for fast calculations.

Jump to the examples directory to see this in action.

Features

Runs multiple clustering attempts to find optimal solution (single runs are susceptible to falling into non-optimal local minima)
Initializes centroids via k-means++ algorithm, for faster convergence
Calculates silhouette score for evaluation
Option to scale data before clustering, so that output isn't biased by different feature scales
Works with high-dimensional data

Install

gem install kmeans-clusterer

Usage

Simple example:

require 'kmeans-clusterer'

data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
        [45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]

labels = ['New York', 'Los Angeles', 'Baltimore', 
          'Portland', 'Washington DC', 'Las Vegas']

k = 2 # find 2 clusters in data

kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5

kmeans.clusters.each do |cluster|
  puts  cluster.id.to_s + '. ' + 
        cluster.points.map(&:label).join(", ") + "\t" +
        cluster.centroid.to_s
end

# Use existing clusters for prediction with new data:
predicted = kmeans.predict [[41.85,-87.65]] # Chicago
puts "\nClosest cluster to Chicago: #{predicted[0]}"

# Clustering quality score. Value between -1.0..1.0 (1.0 is best)
puts "\nSilhouette score: #{kmeans.silhouette.round(2)}"

Output of simple example:

0. New York, Baltimore, Washington DC [39.63, -75.89]
1. Los Angeles, Portland, Las Vegas [38.56, -118.7]

Closest cluster to Chicago: 0

Silhouette score: 0.91

Options

The following options can be passed in to KMeansClusterer.run:

option	default	description
:labels	nil	optional array of Ruby objects to collate with data array
:runs	10	number of times to run kmeans
:log	false	print stats after each run
:init	:kmpp	algorithm for picking initial cluster centroids. Accepts :kmpp, :random, or an array of k centroids
:scale_data	false	scales features before clustering using formula (data - mean) / std
:float_precision	:double	float precision to use. :double or :single
:max_iter	300	max iterations per run

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
examples		examples
lib		lib
test		test
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
MIT-LICENSE		MIT-LICENSE
README.md		README.md
Rakefile		Rakefile
kmeans-clusterer.gemspec		kmeans-clusterer.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

lib

lib

test

test

.gitignore

.gitignore

Gemfile

Gemfile

Gemfile.lock

Gemfile.lock

MIT-LICENSE

MIT-LICENSE

README.md

README.md

Rakefile

Rakefile

kmeans-clusterer.gemspec

kmeans-clusterer.gemspec

Repository files navigation

KMeansClusterer

Features

Install

Usage

Options

About

Releases

Packages

Languages

License

gbuesing/kmeans-clusterer

Folders and files

Latest commit

History

Repository files navigation

KMeansClusterer

Features

Install

Usage

Options

About

Topics

Resources

License

Stars

Watchers

Forks

Languages