How to make Faiss run faster

This page contains all the options and tricks to make Faiss run faster, including the speed-vs-accuracy tradeoffs where the speed is more important.

The topics are provided in no particular order of importance.

Speed vs accuracy: pick the index parameters

It is very common to instantiate an index via faiss.index_factory() call. But this way of instantiating does not allow to pick many of the available speed-related parameters. Many of Faiss components may utilize:

iterative algorithms; the more iterations performed, the better the recall rate is.
heuristics that limit the number of possible candidates considered during the training or the search process, such as beam search. The more candidates are evaluated, the better the recall rate is.
additional optional versions, such as LUT-based approximations; they allow to trade the speed for accuracy.
various internal buffers for processing an input data in chunks, mentioned on this page.
random number generators; these ones are used to ensure that results can be reproduced.

By default, the index parameters should work reasonably good in the most general cases. But it is highly recommended to study the available parameters in order to find an optimal point in you speed-vs-accuracy-vs-RAM case.

The recipe: study the available parameters of a chosen index and its subindices.

Speed vs accuracy: k-means clustering

K-means clustering is an often facility inside Faiss. By default, k-means implementation in faiss/Clustering.h uses 25 iterations and up to 256 samples from the input dataset per cluster needed. The samples are chosen randomly. For example, the default PQx12 training is ~4x slower than PQx10 training and ~16x slower than PQx8 training because of the proportionally higher number of used samples from the input dataset.

IVF-based indices use 10 iterations of k-means by default (faiss/IndexIVF.h, faiss/IndexBinaryIVF.h, faiss/gpu/GpuIndexIVF.cu).

One may play with these numbers.

Huge memory pages pages

It is a typical situation that Faiss uses the random access pattern to access the needed data, such as codebooks. As a result, TLB cache operates under a very heavy pressure. It was observed that the use of 2M / 1G huge memory pages helps in certain cases. In particular, the use of 2M / 1G pages allowed to gain up to 20% speed increase in our certain vector codec experiments on x86-64 platform. Unfortunately, Faiss does not contain a built-in support for marking certain allocated memory blocks as huge-pages-are-important-here ones. So, the simpliest way to try out huge memory pages in your application is to use memory allocators that support ones, such as mimalloc.

We did not conduct any experiments with huge memory pages on ARM platform yet.

The recipe: try a memory allocator that supports huge memory pages and try to run your application with ones being enabled. Please be aware that huge memory pages is a scarce resources and it needs to be managed carefully.

NUMA nodes

The core team is aware that Faiss does not use support the operation under the multiple NUMA nodes. This means that Faiss does not have certain internal facilities that, for example, coordinate memory allocations in order to minimize the traffic between the NUMA nodes. As an example, we had an observation where two runs the same benchmark on a single 12C/24T NUMA node and on four NUMA nodes on the same machine would yield the same running time!

The recipe: try playing with the number of cpu resources that you allocate for your application.

Intel MKL library

Faiss uses Intel MKL library, if possible. MKL uses OpenMP library under the hood. According to the MKL manual, it runs the best if the number of threads equals to the number of physical cores (for example).

The recipe: try setting the number of used OMP threads to the number of physical CPU cores, unless you're using a version of Faiss without any multithreading. So, on a 12C/24T machine run your application as 'OMP_NUM_THREADS=12 my-application'.

Batches

Faiss is optimized for working with batches of samples, rather than processing samples one by one.

The sizes of various buffers

If a huge dataset as an input, then it might be possible that the dataset will be processed in chunks inside Faiss. It is possible to speedup the overall processing by increasing the sizes of chunks, thus increasing the effectiveness of cache and linear algebra operations. On the other hand, it increases the amount of RAM for the allocated chunks and the time that is needed for such allocations. Here is the list of various buffer sizes that one may vary:

faiss::LocalSearchQuantizer.chunk_size
faiss::AdditiveQuantizer.max_mem_distances
faiss::multi_index_quantizer_search_bs
faiss::index2layer_sa_encode_bs
faiss::product_quantizer_compute_codes_bs

(todo: add others)

Intrinsics and AVX-512

Faiss uses certain custom kernels that involve the usage AVX2 or ARM NEON intrinsics. Currently, Faiss does not provide specialized AVX-512 kernels (except for faiss/utils/distances_fused/avx512.h, which needs to be enabled manually). Historically, there are three reasons for this:

AVX-512 computations lead to CPU downclocking, which typically result in a lower performance (compared vs AVX2 version).
AVX-512 kernels need developer time and dedication, and kernels may be tricky to create and benchmark (perf, Intel VTune, uica.uops.info, etc.)
AVX-512 increases the compilation time. Also, Faiss may use Intel MKL library that is capable of using AVX-512 under the hood (if MKL decides to), and it common that GEMM operations take the most of the time.

As of today (Jan 2023):

AVX-512 downclocking seems to be fixed in new generations of Intel CPUs.
AMD Zen4 seems to provide support for AVX-512 instructions.
bfloat16 instructions come to AVX-512, which can be used for certain cases, such as Additive Quantizers. But it needs to be created and benchmarked first.

So, it is possible that Faiss will add some AVX-512 kernels in the future.

Faiss may use Intel MKL library, which allows to enable/disable certain instruction sets via the following. Maybe, it will be beneficial for your CPU to disable AVX-512 for MKL.

The hardware: CPU

(todo: add)

The hardware: GPU

(todo: add)

Speed vs accuracy: pick the index family

(todo: add)

Speed vs accuracy: approximate versions of algorithms

(todo: add about top-k heap)

Vector codecs

(todo: add)

Advanced: C++ templates

Faiss is implemented in C++ and it heavily utilizes template dispatching for various cases. For example, this is the function (faiss/utils/distances_simd.cpp) that computes inner products between a single d-dimenionsal vector x and ny d-dimensional vectors y. As one may see, there are specialized cases for the dimensionalities of 1, 2, 4, 8 and 12, which reference to custom kernels, otherwise the implementation relies on a general-purposes code and a compiler.

void fvec_inner_products_ny(
        float* dis,
        const float* x,
        const float* y,
        size_t d,
        size_t ny) {
#define DISPATCH(dval)                                  \
    case dval:                                          \
        fvec_op_ny_D##dval<ElementOpIP>(dis, x, y, ny); \
        return;

    switch (d) {
        DISPATCH(1)
        DISPATCH(2)
        DISPATCH(4)
        DISPATCH(8)
        DISPATCH(12)
        default:
            fvec_inner_products_ny_ref(dis, x, y, d, ny);
            return;
    }
#undef DISPATCH
}

Custom specialized kernels tend to show better performance than a general-purpose code, but at the cost of the binary size.

Please feel free to add some custom kernels that fit your case, if appropriate, and send us a pull request with your change! :)

Advanced: fused kernels

faiss/utils/distances_fused/distances_fused.h contains a fused kernel that combines the L2 distance computation and the closest nearest search neighbor search.

Feedback

All the feedback regarding the Faiss execution speed is highly appreciated! The most valuable feedback would be a reproducible standalone python script or a C++ small program that allows to detect hot-spots inside Faiss.