Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified memory for device oversubscrption #1175

Open
ptheywood opened this issue Jan 22, 2024 · 3 comments
Open

Unified memory for device oversubscrption #1175

ptheywood opened this issue Jan 22, 2024 · 3 comments

Comments

@ptheywood
Copy link
Member

ptheywood commented Jan 22, 2024

Implementing a CUDA managed memory implementation would enable oversubscription of the GPU on some systems (linux with pascal+, though volta+ might be a better choice for perf reasons).

This would (probably) need to be an additional implementation, rather than replaceing the current un-managed memory implementation, as not all target platforms (some windows) or all supported GPUs (pre-pascal, some tegra's?) support UVM or UVM oversubscription, which is only known at runtime.

For systems with high host-device bandwdith (Grace-Hopper, V100 PPC64LE) the performance impact of this could be relatively small if prefetching is handelled well when devices are not oversubscribed.

Optionally, we could also extend the API to allow users to opt-in to higher performance by declaring what data is expected to be on the device for each agent function.
When oversubscribed, this would reduce the number of page-faults, and also reduce the number of unneccesary prefetch h2d copies (due to d2h eviction when oversubscribed).

I.e. something along the lines of:

auto af = agent.newFunction("move", move);
af.requires("agent", "x");
af.requires("agent", "y");

Without any requires, everything (or nothing) would be prefetched (this might need to be dependent on the gpu in use, and the current total size of all the data that would be prefetched otherwise).

With the requires (or similar, it's a placeholder) only the requested agent variables would be prefetched to avoid unneccesray host-device copies, and prevent evicting old data.

The same could be applied to input message lists, if only part of the message list is required.
Not sure if this would make sense for message outputs (messages are not mutable, so partial outputs don't make a huge amount of sense).
Agent birth could benefit too, but "default" variable initialisation off-device would need special handling.

Other required data for agent functions would probably need always explcitly prefetching so it doesn't get evicted (curve etc)


It will also be worth investigating the use of grid/block stride loops during this, as ways to reduce the number of page faults and avoid the associated performance penalty / latency.

See https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/

@ptheywood ptheywood changed the title Unified nmemory for device oversubscrption Unified memory for device oversubscrption Jan 22, 2024
@ptheywood
Copy link
Member Author

Benchmark Ideas:

  1. Scale population, simple comms-free model (i.e. random walk)
  2. Scale population, message output, mesasge input with fixed message count per agent (bucket, with 1 bucket per N agents, or 2 agent types, small pop that outputs, big pop reads all mesages from small pop).
  3. Scale population, message output, message input with message input scaling (brute force, or spatial with increasing density, brute force would be pretty painful for population large enough to need oversubscription on a gh200 (96 or 144GiB HBM3, upto 480GB host mem i think. 900GB/s host-device, ~4TB's on-device., ~500GB/s on the host)).
  4. Pops that fit in device, but do somethign in a host/step fn that means data gets moved about more, just to highlight the difference.

Then run that benchmark on

  1. Grace-Hopper
  2. Hopper (ideally SXM, but pcie might have to do).
  3. Older GPUs (maybe V100 + V100ppc64le)

Probably also run a smaller subset of the benchmark(s) without unified enabled, to show that unified doesn't lose (much) perf normally (assuming we can get the hinting right, while also supporting oversubscription, otherwise we might need to make unified and oversubscription opt-in/out separately).

@mondus
Copy link
Member

mondus commented Jan 29, 2024

External benchmark use is evident at QMUL and internally on the RSE Blog. This issue will be shared to capture any feedback on further benchmark ideas that could measure host to device transfer.

@ptheywood
Copy link
Member Author

Unified Memory has two basic requirements:

  • a GPU with SM architecture 3.0 or higher (Kepler class or newer)

  • a 64-bit host application and non-embedded operating system (Linux or Windows)

GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription that are outlined throughout this document. Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher. See Data Migration and Coherency for details.

Oversubscription support requires the above, as described in the cuda 12.3 programming guide

SM 6X increased the addressing to use 49 bits, which allows for oversubscription (to allow room for full 48 bit CPU address space + GPU memory)

SM6X also improves UVM in multi-gpu settings, with page faults between devices being supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants