Indexing 1G vectors

For those datasets, compression becomes mandatory (we are talking here about 10M-1G per server). The main compression method used in Faiss is PQ (product quantizer) compression, with a pre-selection based on a coarse quantizer (see previous section). When larger codes can be used a scalar quantizer is more effiicent. All methods are reported with their index_factory string.

In the tests below we use the the Deep1B (96-dim activations from a neural net) and Bigann (128-dim SIFT descriptors) datasets. Both datasets have 1B database vectors and 10k queries. For smaller databases we use the N first database vectors. We report the 1-recall@1 measure that is most sensitive. For illustration, we also report the 1-recall@100 for a single setting: Deep100M with 64-byte codes.

We tried IMI and IVFx_HNSW variants for the coarse quantizer. It turns out that the IVFx_HNSW variant almost always offers the best speed-precision tradeoff. Larger values of the number of centroids have a slightly higher asymptotic accuracy, but the index is also slower to build.

Click on the triangles to see the speed vs. precision tradeoffs plots for a fixed code size. The runtime was measured in batch mode on 16 threads, on machines that have 48 cores. This is to simulate an environment with a diverse set of tasks with some memory stress (in 1-thread mode the computational load is unrealistically important compared to memory access time). The plots report only the combinations that are optimal for at least one operating point, others are in light gray.

All experiments were performed with the code in bench_all_ivf.

10M datasets

The search-time parameters were auto-tuned. The main parameter is the nprobe. Parameters that can be adjusted but are less important are efSearch=64 (for the IVFx_HNSWy) and max_codes=0 (for IMI) -- the values indicated here are reasonable defaults.

In terms of codes:

the best topline precision is always obtained with a PQ as encoding
for faster / less accurate operating points the Scalar Quantizer (SQ4 and SQ8) variants are a good choice because they are cheap to train and to search
the gap narrows down for longer codes (32 and 64 bytes per code)
for recalls at ranks > 1 the difference is not so important. It mainly matters for 1-recall@1 performance (the one that is reported)

The memory (RAM) usage consists of:

the codes
the ids corresponding to codes (8 bytes per item)
the quantizer's size, inverted list pointers + length (at least 16 bytes per centroid), pecomputed tabels for PQ -- note that the new default for Faiss is to not compute tables if the tables are above 2GB in size (an arbitrary number that may be too big for small indexes or could be higher for bigger indexes). The overheads are reported in the caption as + a certain %.
invisible overheads: vector geometric reallocation buffer, etc. These are not reported because they are difficult to measure.

Deep10M, 8 bytes / code

Deep10M, 16 bytes / code

Deep10M, 32 bytes / code

Deep10M, 64 bytes / code

bigann10M, 8 bytes / code

bigann10M, 16 bytes / code

bigann10M, 32 bytes / code

bigann10M, 64 bytes / code

100M datasets

As a coarse quantizer, we tried IMI2x11 (4M centroids), IMI2x12 (16M centroids), IVF65536_HNSW32 and IVF262144_HNSW32.

Deep100M, 8 bytes / code

Deep100M, 16 bytes / code

Deep100M, 32 bytes / code

Deep100M, 64 bytes / code

bigann100M, 8 bytes / code

bigann100M, 16 bytes / code

bigann100M, 32 bytes / code

bigann100M, 64 bytes / code

1B datasets

The datasets are 10x bigger. We also indicate the index building time (with 24 threads, excluding the training time).

Observations:

the PQ building is slowest one, but even for long codes, the difference with the SQ building is less than 2x. This is because encoding of SQ is not optimized very well (as opposed to search).
building with more centroids is slower. For example, deep1B 32-byte codes takes 3h46 for 200k centroids vs. 6h33 with 4M.

Note that for 1M and 4M centroids we trained the vocabulary on GPU before building the index, otherwise k-means is very slow. All other operations are on CPU.

Deep1B, 8 bytes / code

Deep1B, 16 bytes / code

Deep1B, 32 bytes / code

Deep1B, 64 bytes / code

bigann1B, 8 bytes / code

bigann1B, 16 bytes / code

bigann1B, 32 bytes / code

bigann1B, 64 bytes / code

Older results

Also includes some GPU comparisons.

archived

Bigann dataset

The Bigann dataset is a classical benchmark used in computer vision. It contains 1 billion SIFT descriptors. The plot below shows that it is relatively easy to index:

These results were obtained with bench_polysemous_1bn.py:

python bench_polysemous_1bn.py SIFT1000M OPQ8_64,IMI2x14,PQ8 autotuneMT

Deep1B dataset

Another research dataset that was introduced in this CVPR'16 paper. It contains 1Bn 96D descriptors.

Comparison with the state-of-the-art (Bigann)

A recent CVPR'16 paper has a GPU implementation for the search. We experiment with relatively low-precision operating points (8 bytes per code) to allow for a direct comparison with published papers. Note however that for high quality neighbors, more bytes would be required (see above).

Method	Hardware	R@10	query time (ms) / vector
Wieschollek et al. CVPR'16	Titan X	0.35	0.15
OPQ8_64,IMI2x13,PQ8x8	CPU (1 thread)	0.349	0.4852
"	CPU (20 threads)	0.349	0.035
OPQ8_32,IVF262144,PQ8	Titan X	0.376	0.0340
"	"	0.448	0.1214

(methods are described with their index_factory string)

Comparison (Deep1B)

The operating point we are interested in is one that takes ~25 GB of RAM, which corresponds to 20-byte PQ codes. The first row is the best operating point we are aware of at the time we made the comparison. The other rows correspond to different operating points achieved by CPU- and GPU-Faiss algorithms.

Method	Hardware	R@1	query time (ms) / vector
Babenko & al. CVPR'16	CPU (1 thread)	0.45	20
OPQ20_80,IMI2x14,PQ20	CPU (1 thread)	0.4561	3.66
OPQ20_80,IVF262144,PQ20	4*K40	0.488	0.18
"	4*K40	0.493	1.1
OPQ32,IVF262144,PQ32	8*TitanX	0.671	0.2328
OPQ64_128,IVF262144,PQ64 (float16 mode)	8*TitanX	0.856	0. 3207