Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for bad GPUDirect performance with unaligned GPU buffers #1143

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msimberg
Copy link
Collaborator

Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (LLNL/Umpire#881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the gen_to_std miniapp can look like this on current master:

[0]
[0] 17.3253s 495.804GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 11.5633s 742.859GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 2.86979s 2993.22GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0939851s 91396.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 2.95547s 2906.45GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0937317s 91643.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0919855s 93383.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.0930948s 92270.8GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.0933742s 91994.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.0922234s 93142.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

and most of the time looks like this on this PR:

[0]
[0] 0.318221s 26993.7GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[1]
[1] 0.0949778s 90441.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[2]
[2] 0.0906252s 94785.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[3]
[3] 0.0963228s 89178.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[4]
[4] 0.0931526s 92213.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[5]
[5] 0.0924757s 92888.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[6]
[6] 0.0923647s 93000.2GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[7]
[7] 0.09494s 90477.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[8]
[8] 0.091092s 94299.5GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU
[9]
[9] 0.091955s 93414.6GFlop/s dL (20480, 20480) (1024, 1024) (2, 2) 72 GPU

The best case doesn't improve, but the worst case and variance significantly improve.

@@ -71,7 +71,7 @@ void initializeUmpireDeviceAllocator(std::size_t initial_bytes) {
auto device_allocator = umpire::ResourceManager::getInstance().getAllocator("DEVICE");
auto pooled_device_allocator =
umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>(
"DEVICE_pool", device_allocator, initial_bytes);
"DEVICE_pool", device_allocator, initial_bytes, std::size_t(1) << 30, std::size_t(1) << 21);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No magic numbers. Currently this means:

  • When growing the pool after the initial allocation, always grow by at least 1GiB.
  • Alignment is 2MiB (the large page size). Using the regular page size (64KiB) for alignment improves performance, but doesn't completely remove the strange slowdowns.

@@ -45,7 +45,7 @@ void initializeUmpireHostAllocator(std::size_t initial_bytes) {
auto pooled_host_allocator =
umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>("PINNED_pool",
host_allocator,
initial_bytes);
initial_bytes, std::size_t(1) << 30, std::size_t(1) << 21);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't apply this to the host pool?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

1 participant