Skip to content

carsonfarmer/fastpair

Repository files navigation

FastPair

Data-structure for the dynamic closest-pair problem.

Overview

This project is an implementation of the FastPair dynamic closest-pair data-structure described in David Eppstein's Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs. The data-structure is based on the observation that the conga line data- structure, in practice, does better the more subsets you give to it: even though the worst case time for $k$ subsets is $O(nk\log{(n/k)})$, that worst case seems much harder to reach than the nearest neighbor algorithm.

In the limit of arbitrarily many subsets, each new addition or point moved by a deletion will be in a singleton subset, and the algorithm will differ from nearest neighbors in only a couple of ways: (1) when we create the initial data structure, we use a conga line rather than all nearest neighbors, to keep the in-degree of each point low, and (2) when we insert a point, we don't bother updating other points' neighbors.

Total space: $20n$ bytes (could be reduced to 4n at some cost in update time). Time per insertion or single distance update: $O(n)$. Time per deletion or point update: $O(n)$ expected, $O(n^2)$ worst case. Time per closest pair: $O(n)$.

This Python version of the algorithm combines ideas and code from the closest-pair data structure testbed (C++) developed around a series of papers by Eppstein et al.

Installation

FastPairs has not yet been uploaded to PyPi, as we are currently at the 'pre-release' stage*. Having said that you should be able to install it via pip directly from the GitHub repository with:

pip install git+git://github.com/carsonfarmer/fastpair.git

You can also install FastPair by cloning the GitHub repository and using the setup script:

git clone https://github.com/carsonfarmer/fastpair.git
cd fastpair
python setup.py install

* This means the API is not set, and subject to crazy changes at any time!

Testing

Build StatusCoverage Status

FastPair comes with a comprehensive preliminary range of tests. To run the tests, you can use py.test (maybe also nosetests?), which can be installed via pip using the recommended.txt file (note, this will also install numpy, scipy, matplotlib, and IPython which are all great and useful for tests and examples). So far testing has been done with CPython 2.7 and 3.4.

pip install -r recommended.txt
py.test fastpair

Features

In the following examples we use the random module to generate data.

from fastpair import FastPair, interact
import random

def rand_tuple(dim=2):
    return tuple([random.random() for _ in range(dim)])

Basics

The simplest way to use a FastPair data-structure is to initialize one and then update it with data points (via the += operator). In this first example, we create a sequence of $50 \times 10$ uniform random points and add them to a FastPair object:

points = [rand_tuple(10) for _ in range(50)]
# Create empty data-structure with `min_points=10` and
# using a Euclidean distance metric
fp = FastPair()
fp.build(points)  # Add points all at once and build conga line to start

You can then add additional points, and start to query the data-structure for the closest pair of points. As points are added, the data-structure responds and updates accordingly (see this paper for details):

fp += rand_tuple(10)
fp += rand_tuple(10)

# This is the 'FastPair' algorithm, should be fast for large n
fp.closest_pair()
# There is also a brute-force version, can be fast for smaller n
fp.closest_pair_brute_force()

FastPair has several useful properties and methods, including checking the size of the data-structure (i.e., how many points are currently stored), testing for containment of a given point, various methods for computing the closest pair, finding the neighbor of a given point, computing multiple distances at once, and even merging points (clusters):

len(fp)
rando = rand_tuple(10)
points[0] in fp  # True
rando in fp  # False
fp()  # Compute closest pair
neigh = fp.find_neighbor(rando)  # Neighbor of 'outside' point
fp.sdist(rando)  # Compute distances from rando to all points in fp

To illustrate the mergeing methods, here is a simple example of hierarchical clustering, treating points as the 'centroids' of various clusters:

for i in range(len(fp)-1):
    # First method... do it manually:
    dist, (a, b) = fp.closest_pair()
    c = interact(a, b)  # Compute mean centroid
    fp -= b
    fp -= a
    fp += c
    # Alternatively... do it all in one step:
    # fp.merge_closest()
len(fp)  # 1

Finally, plotting should be pretty obvious to those familiar with matplotlib (or other Python plotting facilities).

import matplotlib.pyplot as plt

points = [rand_tuple(2) for _ in range(50)]  # 2D points
fp = FastPair().build(points)
dist, (a, b) = fp.closest_pair()

plt.figure()
plt.scatter(*zip(*fp.points))
plt.scatter(*zip(a, b), color="red")
plt.title("Closest pair is {:.2} units apart.".format(dist))
plt.show()

License

Copyright © 2016, Carson J. Q. Farmer
Copyright © 2002-2015, David Eppstein
Licensed under the MIT License.

About

FastPair: Data-structure for the dynamic closest-pair problem.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages