Skip to content

Faiss building blocks: clustering, PCA, quantization

Matthijs Douze edited this page Oct 1, 2022 · 9 revisions

Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

Clustering

Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x is done as follows:

ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(x)

The resulting centroids are in kmeans.centroids.

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj, and more extensive statistics in kmeans.iteration_stats.

To run it on GPU, add the option gpu=True to the Kmeans constructor. This will use all available GPUs on the machine.

Additional options

The Kmeans object is mainly a layer of the C++ Clustering object, and all fields of that object can be set via the constructor. The fields include:

  • nredo: run the clustering this number of times, and keep the best centroids (selected according to clustering objective)

  • verbose: make clustering more verbose

  • spherical: perform spherical k-means -- the centroids are L2 normalized after each iteration

  • int_centroids: round centroids coordinates to integer

  • update_index: re-train index after each iteration?

  • min_points_per_centroid / max_points_per_centroid: below, you get a warning, above the training set is subsampled

  • seed: seed for the random number generator

Assignment

To compute the mapping from a set of vectors x to the cluster centroids after kmeans has finished training, use:

D, I = kmeans.index.search(x, 1)

This will return the nearest centroid for each line vector in x in I. D contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in x to the computed centroids, a new index must be used:

index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

I of size (ncentroids, 15) contains the nearest neighbors for each centroid.

Clustering on the GPU

Clustering on one or several GPUs can be done via the gpu=True (use all gpus) or gpu=3 (use 3 gpus) constructor option in the KMeans object.

Computing a PCA

Let's reduce 40D vectors to 10D.

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

Quantizers

The quantizer objects inherit from Quantizer that offers three common methods (see impl/Quantizer.h) :

  • train: train the quantizer on a matrix of vectors

  • compute_codes and decode: encoder and decoder. The encoder is usually lossy and returns a uint8 matrix of codes for each input vector.

  • get_DistanceComputer is a method returning a DistanceComputer object (see below).

The state of the Quantizer object is the result of the training (codebook tables or normalization coefficients). The field code_size indicates the number of bytes per codes generated by the quantizer.

Types of quantizers

The supported quantizer types are:

  • ScalarQuantizer: quantizes each vector component separately in a linear range.

  • ProductQuantizer: performs vector quantization on sub-vectors

  • AdditiveQuantizer: encodes a vector as a sum of codebook entries, see Addtive Quantizers for details. Additive quantizers can be trained in several ways, hence the sub-classes ResidualQuantizer, LocalSearchQuantizer, ProductAdditiveQuantizer.

Interestingly, each quantizer is a superset of the previous one.

Each quantizer has a corresponding index type that also stores a set of quantized vectors.

Quantizer class Flat index class IVF index class
ScalarQuantizer IndexScalarQuantizer IndexIVFScalarQuantizer
ProductQuantizer IndexPQ IndexIVFPQ
AdditiveQuantizer IndexAdditiveQuantizer IndexIVFAdditiveQuantizer
ResidualQuantizer IndexResidualQuantizer IndexIVFResidualQuantizer
LocalSearchQuantizer IndexLocalSearchQuantizer IndexIVFLocalSearchQuantizer

In addition all indexes except the ScalarQuantizer ones exist in a FastScan version, see Fast accumulation of PQ and AQ codes

Distance computers

Some quantizers return a DistanceComputer object. Its purpose is to compute vector-to-code distances efficiently, when one vector is compared with many codes. In that case, it is often possible to precompute a set of tables to compute the distances directly in the compressed domain.

Therefore, the DistanceComputer offers:

  • a set_query method that sets the current uncompressed vector to compare with

  • a distance_to_code method that computes the actual distance with a given code.

Example: PQ encoding / decoding

The ProductQuantizer object can be used to encode or decode vectors to codes:

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

A scalar quantizer works similarly:

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
Clone this wiki locally