Unified memory for device oversubscrption #1175

ptheywood · 2024-01-22T12:25:17Z

Implementing a CUDA managed memory implementation would enable oversubscription of the GPU on some systems (linux with pascal+, though volta+ might be a better choice for perf reasons).

This would (probably) need to be an additional implementation, rather than replaceing the current un-managed memory implementation, as not all target platforms (some windows) or all supported GPUs (pre-pascal, some tegra's?) support UVM or UVM oversubscription, which is only known at runtime.

For systems with high host-device bandwdith (Grace-Hopper, V100 PPC64LE) the performance impact of this could be relatively small if prefetching is handelled well when devices are not oversubscribed.

Optionally, we could also extend the API to allow users to opt-in to higher performance by declaring what data is expected to be on the device for each agent function.
When oversubscribed, this would reduce the number of page-faults, and also reduce the number of unneccesary prefetch h2d copies (due to d2h eviction when oversubscribed).

I.e. something along the lines of:

auto af = agent.newFunction("move", move);
af.requires("agent", "x");
af.requires("agent", "y");

Without any requires, everything (or nothing) would be prefetched (this might need to be dependent on the gpu in use, and the current total size of all the data that would be prefetched otherwise).

With the requires (or similar, it's a placeholder) only the requested agent variables would be prefetched to avoid unneccesray host-device copies, and prevent evicting old data.

The same could be applied to input message lists, if only part of the message list is required.
Not sure if this would make sense for message outputs (messages are not mutable, so partial outputs don't make a huge amount of sense).
Agent birth could benefit too, but "default" variable initialisation off-device would need special handling.

Other required data for agent functions would probably need always explcitly prefetching so it doesn't get evicted (curve etc)

It will also be worth investigating the use of grid/block stride loops during this, as ways to reduce the number of page faults and avoid the associated performance penalty / latency.

See https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/

The text was updated successfully, but these errors were encountered:

ptheywood · 2024-01-29T13:33:39Z

Benchmark Ideas:

Scale population, simple comms-free model (i.e. random walk)
Scale population, message output, mesasge input with fixed message count per agent (bucket, with 1 bucket per N agents, or 2 agent types, small pop that outputs, big pop reads all mesages from small pop).
Scale population, message output, message input with message input scaling (brute force, or spatial with increasing density, brute force would be pretty painful for population large enough to need oversubscription on a gh200 (96 or 144GiB HBM3, upto 480GB host mem i think. 900GB/s host-device, ~4TB's on-device., ~500GB/s on the host)).
Pops that fit in device, but do somethign in a host/step fn that means data gets moved about more, just to highlight the difference.

Then run that benchmark on

Grace-Hopper
Hopper (ideally SXM, but pcie might have to do).
Older GPUs (maybe V100 + V100ppc64le)

Probably also run a smaller subset of the benchmark(s) without unified enabled, to show that unified doesn't lose (much) perf normally (assuming we can get the hinting right, while also supporting oversubscription, otherwise we might need to make unified and oversubscription opt-in/out separately).

mondus · 2024-01-29T13:35:12Z

External benchmark use is evident at QMUL and internally on the RSE Blog. This issue will be shared to capture any feedback on further benchmark ideas that could measure host to device transfer.

ptheywood · 2024-02-09T15:09:22Z

Unified Memory has two basic requirements:

a GPU with SM architecture 3.0 or higher (Kepler class or newer)

a 64-bit host application and non-embedded operating system (Linux or Windows)

GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription that are outlined throughout this document. Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher. See Data Migration and Coherency for details.

Oversubscription support requires the above, as described in the cuda 12.3 programming guide

SM 6X increased the addressing to use 49 bits, which allows for oversubscription (to allow room for full 48 bit CPU address space + GPU memory)

SM6X also improves UVM in multi-gpu settings, with page faults between devices being supported.

ptheywood added the enhancement label Jan 22, 2024

ptheywood changed the title ~~Unified nmemory for device oversubscrption~~ Unified memory for device oversubscrption Jan 22, 2024

ptheywood mentioned this issue Feb 9, 2024

Agent type for low-population highly parallel workload agents #1184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified memory for device oversubscrption #1175

Unified memory for device oversubscrption #1175

ptheywood commented Jan 22, 2024 •

edited

ptheywood commented Jan 29, 2024

mondus commented Jan 29, 2024 •

edited

ptheywood commented Feb 9, 2024

Unified memory for device oversubscrption #1175

Unified memory for device oversubscrption #1175

Comments

ptheywood commented Jan 22, 2024 • edited

ptheywood commented Jan 29, 2024

mondus commented Jan 29, 2024 • edited

ptheywood commented Feb 9, 2024

ptheywood commented Jan 22, 2024 •

edited

mondus commented Jan 29, 2024 •

edited