Faiss building blocks: clustering, PCA, quantization

Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

Clustering

Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x is done as follows:

ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)

The resulting centroids are in kmeans.centroids.

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj.

Assignment

To compute the mapping from a set of vectors x to the cluster centroids after kmeans has finished training, use:

D, I = kmeans.index.search(x, 1)

This will return the nearest centroid for each line vector in x in I. D contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in x to the computed centroids, a new index must be used:

index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

I of size (ncentroids, 15) contains the nearest neighbors for each centroid.

Clustering on the GPU

Clustering on one or several GPUs requires to adapt the indexing object a bit.

Computing a PCA

Let's reduce 40D vectors to 10D.

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

Note that in C++, apply_py is replaced with apply (apply is a reserved word in Python).

PQ encoding / decoding

The ProductQuantizer object can be used to encode or decode vectors to codes:

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

A scalar quantizer works similarly:

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()