Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

msimberg · 2024-05-15T11:33:58Z

Is your feature request related to a problem? Please describe.

The underlying allocator may have sufficient alignment, but aligned_allocate always overallocates to guarantee the alignment, even if it may not be necessary:

Umpire/src/umpire/strategy/mixins/AlignedAllocation.inl

Line 25 in 45159e8

std::size_t total_bytes{ size + m_alignment };

. This wastes (a bit of) memory, and may cause performance issues with some MPI libraries.

Describe the solution you'd like

Avoid overallocation if the underlying allocator provides sufficiently aligned allocations.

Describe alternatives you've considered

Allow controlling alignment of backing buffers separately from alignment of user-facing allocations. The latter should probably never be larger than the former.

Additional context

This is really a feature request that comes from investigating what may be a bug in Cray MPICH, but I wanted to report it here as well since I think Umpire could in some situations do a better job (or I'm simply unaware of the knobs that Umpire has for controlling this, so looking for input in any case).

In our application we use Umpire's QuickPool to pool allocations of GPU buffers. QuickPool will use aligned_allocate to allocate backing buffers from e.g. CUDA, but if I ask for a 1 GiB buffer QuickPool will allocate 1 GiB plus alignment (16 by default) to guarantee that the allocation is aligned. It turns out that when using GPU-aware MPI communicating a buffer whose size isn't page-aligned (I think this is the requirement, but I'm still looking into the details) performance drops considerably. I'm separately reporting this issue to HPE.

I could set the alignment of the QuickPool to the page size to get an appropriately sized backing buffer, but if I understand correctly then all allocations on top of that will also have page-sized alignment, which is excessive for small allocations and can end up wasting a lot of memory. From what I can tell DynamicPoolList behaves the same as QuickPool (is there a reason to prefer one or the other by the way?).

Is there a way already to control the alignment of the backing buffers and "real" allocations on top of it separately? Is there another pool that we could use to get the behaviour we want?

Just out of curiousity since I couldn't find it, where is the code for ensuring that a QuickPool allocation starts at the correct alignment? I see the size is adjusted here:

Umpire/src/umpire/strategy/QuickPool.cpp

Line 46 in 45159e8

const std::size_t rounded_bytes{aligned_round_up(bytes)};

. Edit: I realize this probably happens by construction. If the backing buffers have sufficient alignment and all the allocations have aligned sizes they'll be guaranteed to start aligned as well.

Thanks for your help!

The text was updated successfully, but these errors were encountered:

msimberg mentioned this issue May 17, 2024

Workaround for bad GPUDirect performance with unaligned GPU buffers eth-cscs/DLA-Future#1143

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

msimberg commented May 15, 2024 •

edited

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

Comments

msimberg commented May 15, 2024 • edited

msimberg commented May 15, 2024 •

edited