Faiss GPU: improve error information for GPU OOM #3060

wickedfoo · 2023-09-14T02:58:41Z

Summary:
This diff updates logging in case of GPU out of memory errors, whether from cudaMalloc directly or from the RAFT allocator. In case of a memory error, allocator state (including an indication of CUDA-reported free memory on the device) is returned as part of the exception message, like this:

C++ exception with description "Error in virtual void *faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest &) at fbcode/faiss/gpu/StandardGpuResources.cpp:570: StandardGpuResources: Faiss device allocator fail type IVFLists dev 1 space Device stream 0x7fa07623b440 size 1024 bytes                                                                                                                                   Allocator state:                                                                                                                                                GPU device 1 allocator state:                                                                                                                                   ==========                                                                                                                                                      Device free memory: 82400968704 bytes                                                                                                                           Allocator temp memory remaining: 1610612720
Outstanding Faiss allocations:
Alloc type TemporaryMemoryBuffer: 1 allocations, 1610612736 bytes
Alloc type FlatData: 2 allocations, 59648 bytes

In the case where Faiss is built using RAFT, previously no error information was provided if the RAFT memory manager had an OOM error, but now it will produce a string similar to the above. The Faiss memory manager (StandardGpuResources) continues to log all allocations made and passed to the RAFT memory manager, so we can also receive an indication of what is allocated and for what purpose.

In addition, this fixes the issue where Faiss GPU would not compile (in fbcode at least) if the USE_NVIDIA_RAFT define was not available. Now the library compiles both with and without RAFT.

Also updated the #if defined USE_NVIDIA_RAFT to #ifdef USE_NVIDIA_RAFT` to better conform to the rest of the GPU code.

This diff also disables the temporary memory allocation of 1.5 GB made up front if RAFT is being used, which is really what is intended for using the RAFT memory manager. Otherwise this diff does not change the runtime behavior of Faiss GPU otherwise, but this diff is being made to better debug GPU OOM issues with Faiss usage.

Reviewed By: mdouze

Differential Revision: D49260364

Summary: This diff updates logging in case of GPU out of memory errors, whether from `cudaMalloc` directly or from the RAFT allocator. In case of a memory error, allocator state (including an indication of CUDA-reported free memory on the device) is returned as part of the exception message, like this: ``` C++ exception with description "Error in virtual void *faiss::gpu::StandardGpuResourcesImpl::allocMemory(const faiss::gpu::AllocRequest &) at fbcode/faiss/gpu/StandardGpuResources.cpp:570: StandardGpuResources: Faiss device allocator fail type IVFLists dev 1 space Device stream 0x7fa07623b440 size 1024 bytes Allocator state: GPU device 1 allocator state: ========== Device free memory: 82400968704 bytes Allocator temp memory remaining: 1610612720 Outstanding Faiss allocations: Alloc type TemporaryMemoryBuffer: 1 allocations, 1610612736 bytes Alloc type FlatData: 2 allocations, 59648 bytes ``` In the case where Faiss is built using RAFT, previously no error information was provided if the RAFT memory manager had an OOM error, but now it will produce a string similar to the above. The Faiss memory manager (StandardGpuResources) continues to log all allocations made and passed to the RAFT memory manager, so we can also receive an indication of what is allocated and for what purpose. In addition, this fixes the issue where Faiss GPU would not compile (in fbcode at least) if the `USE_NVIDIA_RAFT` define was not available. Now the library compiles both with and without RAFT. Also updated the `#if defined USE_NVIDIA_RAFT` to #ifdef USE_NVIDIA_RAFT` to better conform to the rest of the GPU code. This diff also disables the temporary memory allocation of 1.5 GB made up front if RAFT is being used, which is really what is intended for using the RAFT memory manager. Otherwise this diff does not change the runtime behavior of Faiss GPU otherwise, but this diff is being made to better debug GPU OOM issues with Faiss usage. Reviewed By: mdouze Differential Revision: D49260364

facebook-github-bot · 2023-09-14T02:58:49Z

This pull request was exported from Phabricator. Differential Revision: D49260364

wickedfoo · 2023-09-14T16:39:27Z

@cjnolet FYI with the Faiss/RAFT integration when the RAFT memory manager was being used, we were still reserving 1.5 GB of memory taken from the RAFT allocator which was given out using the Faiss memory stack code, as the temporary memory management stuff inside StandardGpuResources was still being used. I presume that the purpose of the RAFT manager is also to allow for efficient memory allocation for large chunks of temporary memory as well (e.g., grab 100 MB to use for temporary calculations, immediately return it). The RAFT manager was thus mainly used for permanent memory allocations (e.g., IVF list data and the like).

Instead now we just pass all allocation requests to the RAFT memory manager (temporary or "permanent") instead if we're enabled with that as a part of this diff.

wickedfoo · 2023-10-13T23:07:46Z

@cjnolet What is your thought on this change to the memory allocator?

Currently in the repo, it was using (likely by accident) both the Faiss and RAFT memory allocator. The Faiss allocator was used for "temporary" memory allocations (handed out from a stack of 1.5 GB pre-allocated up front) while all non-temporary allocations (e.g., IVF list data) or overflow temporary allocations (we asked for more temporary allocation than was currently available) would fall back to RAFT.

My concern is if this would be a performance regression, like if the RAFT allocator doesn't function well at handing out temporary 100+ MB chunks of memory which are quickly returned to the allocator (stream ordered, after dependent kernels are allocated). Instead now with this diff all allocations would go to RAFT.

Does the RAFT memory allocator handle very large (like 100-500 MB) but temporary (we ask for an allocation and return it within the same C++ function after calling 1-N kernels that use it) allocations well, or is it more of a "small-ish block allocator" strategy (i.e., it only works well on allocation churn for smaller allocations).

cjnolet · 2023-10-16T20:04:12Z

@wickedfoo,

I want to first draw a brief distinction between RMM (RAPIDS memory manager) and RAFT (primitives for vector search, information retreival, and ML).

I think you are right about both FAISS and RMM being used at the same time. The RMM lead is out this week, but I'm going to schedule an information exchange session when he's back. In the meantime, I'll say that RMM itself is kind of the "only GPU memory manager you need" in the way that its primary goals are to

centralized device memory management across libraries, and
allow user control over the centralized allocation strategy.

Number 2 is often done in the Python layer but could be done in the C++ layer as well.

RMM contains different strategies for allocating memory. The default strategy for device memory is a device_memory_resource, which uses cudaMalloc under the hood. There's also a pool_memory_resource which allocates a chunk of memory up front and hands out subsequent memory allocations from that pool. The benefit to RMM being shared across libraries is that the user can specify a pool_memory_resource (either bound or unbound) as their default resource, and all libraries that support RMM will allocate from that same pool.

RAFT has found that the default memory resource is usually good enough, but we've recently started to support having a second allocation strategy for temporary workspace allcoations that we call the workspace_resource. So far only a handful of algorithms support this, but it allows us to, for example, set a pool allocator for the more temporary buffers and perhaps use managed memory or just a regular cuda allocator for the longer-term memory like storing original vectors or training datasets.

Does this sound like it will support what FAISS needs? I should mention that RMM is a very lightweight and header-only memory management library and so we could enable RMM in the GpuResources always and just keep RAFT conditional for now.

Does this answer your questions? Do you see any potential problems w/ the way RMM works or bad assumptions w.r.t the way StandardGpuResources was designed to work?

facebook-github-bot added CLA Signed fb-exported labels Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faiss GPU: improve error information for GPU OOM #3060

Faiss GPU: improve error information for GPU OOM #3060

wickedfoo commented Sep 14, 2023

facebook-github-bot commented Sep 14, 2023

wickedfoo commented Sep 14, 2023

wickedfoo commented Oct 13, 2023

cjnolet commented Oct 16, 2023

Faiss GPU: improve error information for GPU OOM #3060

Are you sure you want to change the base?

Faiss GPU: improve error information for GPU OOM #3060

Conversation

wickedfoo commented Sep 14, 2023

facebook-github-bot commented Sep 14, 2023

wickedfoo commented Sep 14, 2023

wickedfoo commented Oct 13, 2023

cjnolet commented Oct 16, 2023