Data-structure for the dynamic closest-pair problem.
This project is an implementation of the FastPair dynamic closest-pair
data-structure described in David Eppstein's Fast Hierarchical Clustering
and Other Applications of Dynamic Closest Pairs.
The data-structure is based on the observation that the conga line data-
structure, in
practice, does better the more subsets you give to it: even
though the worst case time for
In the limit of arbitrarily many subsets, each new addition or point moved by a deletion will be in a singleton subset, and the algorithm will differ from nearest neighbors in only a couple of ways: (1) when we create the initial data structure, we use a conga line rather than all nearest neighbors, to keep the in-degree of each point low, and (2) when we insert a point, we don't bother updating other points' neighbors.
Total space:
This Python
version of the algorithm combines ideas and code from the
closest-pair data structure testbed
(C++)
developed around a
series of papers
by Eppstein et al.
FastPairs
has not yet been uploaded to PyPi,
as we are currently at the 'pre-release' stage*. Having said that you should be
able to install it via pip
directly from the GitHub repository with:
pip install git+git://github.com/carsonfarmer/fastpair.git
You can also install FastPair
by cloning the
GitHub repository and using the
setup script:
git clone https://github.com/carsonfarmer/fastpair.git
cd fastpair
python setup.py install
* This means the API is not set, and subject to crazy changes at any time!
FastPair
comes with a comprehensive preliminary range
of tests. To run the tests, you can use py.test
(maybe also nosetests
?), which can be installed via pip
using the
recommended.txt
file (note, this will also install numpy
, scipy
,
matplotlib
, and IPython
which are all great and useful for
tests and examples). So far testing has been done with CPython
2.7 and 3.4.
pip install -r recommended.txt
py.test fastpair
In the following examples we use the random
module to generate data.
from fastpair import FastPair, interact
import random
def rand_tuple(dim=2):
return tuple([random.random() for _ in range(dim)])
The simplest way to use a FastPair
data-structure is to initialize one
and then update it with data points (via the +=
operator). In this first example, we create a sequence of FastPair
object:
points = [rand_tuple(10) for _ in range(50)]
# Create empty data-structure with `min_points=10` and
# using a Euclidean distance metric
fp = FastPair()
fp.build(points) # Add points all at once and build conga line to start
You can then add additional points, and start to query the data-structure for the closest pair of points. As points are added, the data-structure responds and updates accordingly (see this paper for details):
fp += rand_tuple(10)
fp += rand_tuple(10)
# This is the 'FastPair' algorithm, should be fast for large n
fp.closest_pair()
# There is also a brute-force version, can be fast for smaller n
fp.closest_pair_brute_force()
FastPair
has several useful properties and methods, including checking the
size of the data-structure (i.e., how many points are currently stored),
testing for containment of a given point, various methods for computing the
closest pair, finding the neighbor of a given point, computing multiple
distances at once, and even merging points (clusters):
len(fp)
rando = rand_tuple(10)
points[0] in fp # True
rando in fp # False
fp() # Compute closest pair
neigh = fp.find_neighbor(rando) # Neighbor of 'outside' point
fp.sdist(rando) # Compute distances from rando to all points in fp
To illustrate the merge
ing methods, here is a simple example of hierarchical
clustering, treating points
as the 'centroids' of various clusters:
for i in range(len(fp)-1):
# First method... do it manually:
dist, (a, b) = fp.closest_pair()
c = interact(a, b) # Compute mean centroid
fp -= b
fp -= a
fp += c
# Alternatively... do it all in one step:
# fp.merge_closest()
len(fp) # 1
Finally, plotting should be pretty obvious to those familiar with matplotlib
(or other Python
plotting facilities).
import matplotlib.pyplot as plt
points = [rand_tuple(2) for _ in range(50)] # 2D points
fp = FastPair().build(points)
dist, (a, b) = fp.closest_pair()
plt.figure()
plt.scatter(*zip(*fp.points))
plt.scatter(*zip(a, b), color="red")
plt.title("Closest pair is {:.2} units apart.".format(dist))
plt.show()
Copyright © 2016, Carson J. Q. Farmer
Copyright © 2002-2015, David Eppstein
Licensed under the MIT License.