-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified memory for device oversubscrption #1175
Comments
Benchmark Ideas:
Then run that benchmark on
Probably also run a smaller subset of the benchmark(s) without unified enabled, to show that unified doesn't lose (much) perf normally (assuming we can get the hinting right, while also supporting oversubscription, otherwise we might need to make unified and oversubscription opt-in/out separately). |
Oversubscription support requires the above, as described in the cuda 12.3 programming guide SM 6X increased the addressing to use 49 bits, which allows for oversubscription (to allow room for full 48 bit CPU address space + GPU memory) SM6X also improves UVM in multi-gpu settings, with page faults between devices being supported. |
Implementing a CUDA managed memory implementation would enable oversubscription of the GPU on some systems (linux with pascal+, though volta+ might be a better choice for perf reasons).
This would (probably) need to be an additional implementation, rather than replaceing the current un-managed memory implementation, as not all target platforms (some windows) or all supported GPUs (pre-pascal, some tegra's?) support UVM or UVM oversubscription, which is only known at runtime.
For systems with high host-device bandwdith (Grace-Hopper, V100 PPC64LE) the performance impact of this could be relatively small if prefetching is handelled well when devices are not oversubscribed.
Optionally, we could also extend the API to allow users to opt-in to higher performance by declaring what data is expected to be on the device for each agent function.
When oversubscribed, this would reduce the number of page-faults, and also reduce the number of unneccesary prefetch h2d copies (due to d2h eviction when oversubscribed).
I.e. something along the lines of:
Without any
requires
, everything (or nothing) would be prefetched (this might need to be dependent on the gpu in use, and the current total size of all the data that would be prefetched otherwise).With the
requires
(or similar, it's a placeholder) only the requested agent variables would be prefetched to avoid unneccesray host-device copies, and prevent evicting old data.The same could be applied to input message lists, if only part of the message list is required.
Not sure if this would make sense for message outputs (messages are not mutable, so partial outputs don't make a huge amount of sense).
Agent birth could benefit too, but "default" variable initialisation off-device would need special handling.
Other required data for agent functions would probably need always explcitly prefetching so it doesn't get evicted (curve etc)
It will also be worth investigating the use of grid/block stride loops during this, as ways to reduce the number of page faults and avoid the associated performance penalty / latency.
See https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/
The text was updated successfully, but these errors were encountered: