Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

msimberg · 2024-05-17T13:14:45Z

Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (LLNL/Umpire#881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the gen_to_std miniapp can look like this on current master:

[0]
[0] 17.3253s 495.804GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 11.5633s 742.859GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 2.86979s 2993.22GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0939851s 91396.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 2.95547s 2906.45GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0937317s 91643.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0919855s 93383.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.0930948s 92270.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.0933742s 91994.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.0922234s 93142.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

and most of the time looks like this on this PR:

[0]
[0] 0.318221s 26993.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 0.0949778s 90441.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 0.0906252s 94785.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0963228s 89178.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 0.0931526s 92213.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0924757s 92888.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0923647s 93000.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.09494s 90477.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.091092s 94299.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.091955s 93414.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

The best case doesn't improve, but the worst case and variance significantly improve.

msimberg · 2024-05-17T13:16:20Z

src/memory/memory_chunk.cpp

@@ -71,7 +71,7 @@ void initializeUmpireDeviceAllocator(std::size_t initial_bytes) {
    auto device_allocator = umpire::ResourceManager::getInstance().getAllocator("DEVICE");
    auto pooled_device_allocator =
        umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>(
-            "DEVICE_pool", device_allocator, initial_bytes);
+            "DEVICE_pool", device_allocator, initial_bytes, std::size_t(1) << 30, std::size_t(1) << 21);


No magic numbers. Currently this means:

When growing the pool after the initial allocation, always grow by at least 1GiB.

Alignment is 2MiB (the large page size). Using the regular page size (64KiB) for alignment improves performance, but doesn't completely remove the strange slowdowns.

msimberg · 2024-05-17T13:16:34Z

src/memory/memory_chunk.cpp

@@ -45,7 +45,7 @@ void initializeUmpireHostAllocator(std::size_t initial_bytes) {
    auto pooled_host_allocator =
        umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>("PINNED_pool",
                                                                                          host_allocator,
-                                                                                          initial_bytes);
+                                                                                          initial_bytes, std::size_t(1) << 30, std::size_t(1) << 21);


Maybe don't apply this to the host pool?

Workaround for bad GPUDirect performance with unaligned GPU buffers

2e63df3

msimberg commented May 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

msimberg commented May 17, 2024

msimberg May 17, 2024

msimberg May 17, 2024

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

Are you sure you want to change the base?

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

Conversation

msimberg commented May 17, 2024

msimberg May 17, 2024

Choose a reason for hiding this comment

msimberg May 17, 2024

Choose a reason for hiding this comment