Fast accumulation of PQ and AQ codes (FastScan)

What is this?

Most product quantizer (PQ) decompositions use 8 bits per sub-vector, which is convenient because it is byte-aligned. At search time, the look-up tables are stored in RAM (hopefully in cache) and can be used for distance computations, see equation (13) in Product quantization for nearest neighbor search.

However, modern CPUs are more efficient when computing arithmetic operations than doing memory look-ups. One way to avoid this is to degrade to less powerful vector encodings (like binary or scalar decompositions).

Another way is to store the search-time look-up tables in registers. This idea was introduced in "Cache locality is not enough: high-performance nearest neighbor search with product quantization fast scan", André et al, VLDB 15, hence the name "fast-scan" of the current implementation.

Implementation

The IndexPQFastScan and IndexIVFPQFastScan objects perform 4-bit PQ fast scan. The implementation is heavily inspired by Google's (SCANN)[https://github.com/google-research/google-research/tree/master/scann].

They do not inherit directly from IndexPQ and IndexIVFPQ because the codes are "packed" in batches of bbs=32 (64 and 96 are supported as well but there are few operating points where they are competitive).

Fitting look-up tables in registers is possible only under the following conditions:

there must be many vector registers and there must be a look-up instruction (aka. shuffle). This is supported on architectures like the Intel AVX (16 256-bit registers, only platform supported currently) or ARM Neon (32 128-bit registers).
the LUT tables must be short: 8-bit PQ produces 256 LUT entries, so reduce it to 4-bit.
the LUT table entries must be small: we cannot afford to store floating point entries, so they are quantized to 8 bit per entry.

Re-ranking

Since the 4-bit PQ has a relatively low accuracy (PQ32x4 is significantly less accurate than PQ16x8 although they use the same amount of memory), it is useful to perform a re-ranking stage with exact distance computations.

This is supported via the IndexRefine object. It manages two indexes, the second one refining the results from the first index. For example, if we need k=10 results, we query k * k_factor = 100 elements in the first index and compute exact (or more accurate) distances for these results and return the k first ones.

The drawbacks are that this requires to store a larger index, which needs to be controlled in memory-constrained settings, and there is one additional parameter (k_factor) to tune.

Factory strings

Faiss indexes can be constructed with the index_factory function that builds an index from a string.

The new PQ variants are supported via new factory strings:

PQ32x4fs means using the "fast-scan" variant of PQ32x4. They can be prefixed with IVFxx to generate an IVF index. IVFy,PQ32x4fsr is the IVF variant where PQ encodes the residual vector relative to the centroid (more accurate but slightly slower).
the refinement was already supported via the RFlat suffix. Any index as refinement is now supported via Refine(SQ8) which means refine using a SQ8 index.
the 4-bit PQ is also useful as a coarse quantizer. Therefore, the IVF factory string has been generalized to IVF1000(PQ32x4fs,Rflat) which means using PQ32x4fs,Rflat as a coarse quantizer to quantize to 1000 centroids.

See a complex example here.

Performance

Comparison with SCANN: The speed-accuracy tradeoff of the Faiss 4-bit PQ fast-scan implementation is compared with SCANN on 4 1M-scale datasets. Basically, it is at least as fast and often faster.

Comparison with HSNW: without reranking, 4-bit PQ is able to do up to 1M QPS. With re-ranking, it is at 280k QPS with 1-recall@1 = 0.9, which it 2x faster than HNSW's 140k QPS. It also uses 2.7x less memory because the graph structure does not need to be stored.

Best indexes for 10M-100M vectors: the experiment are broken down per code size. They show 4-bit PQ fast-scan is useful as a coarse quantizer and for almost all operating points except the most accurate ones.

From about 64 bytes per vector, it can be combined with reranking. For example, to fit in 64 bytes per vector, 8 bytes can be allocated to the fast-scan PQ and 56 bytes to the refinement index (a larger PQ).

Additive quantization implementation

The additive quantizer (implemented by KinglittleQ) works similarly to PQ at search time. The difference is that for the

Implementation details

The computational kernel handles 32 vectors at a time.

The implementation is geared towards AVX but ARM Neon is also supported (thanks to @vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528). It uses a small SIMD abstraction library, simdlib.h

Code layout

For bbs = 32 and a PQ of M sub-quantizers, the codes are laid out in blocks of 32 bytes as follows

address (bytes)	sub-quantizer
0	0 and 1
32	2 and 3
...	...
(M/2 - 1) * 32	M-2 and M-1

Each 32-byte block represents the codes for 2 sub-quantizers for 32 vectors. This table gives where the quantization indexes are located for the 32 vectors, for sub-quantizers m and m+1:

A didactic Python implementation: simulate_kernels_PQ4.ipynb (the relevant section is loop3)

The C++ implementation pq4_fast_scan.h.