New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Add support for compilation to LTO-IR #9274
Conversation
gpuci run tests |
gpuci run tests |
1 similar comment
gpuci run tests |
gpuci run tests |
This is a bit clearer about what it does, and will be a more representative name when compilation to LTO-IR is also supported. We also rename the buffer held by `CompilationUnit.compile()`, because the buffer will no longer be limited to holding PTX only.
We never have multiple PTX outputs anymore (this was only necessary with NVVM 3.4), there's no need to make lists of them or join them
This addition explicitly states the behaviour when a return type is or is not supplied as part of the signature; previously the user would have had to guess this, or discover it through accident / experiment.
This follows a very similar process to PTX compilation - LTO generation is enabled with NVVM's `-gen-lto` flag.
We create a more generic function, `numba.cuda.compile()`, that provides similar functionality to `compile_ptx()`, but allowing the choice of PTX or LTO-IR output. This function defaults to the C ABI rather than the Numba one, as this is expected to be more convenient for most use cases. We also add a variant to target the current device. The original `compile_ptx()` and variant for the current device are left in to support existing use cases that use them and expect generated code to use the Numba ABI.
gpuci run tests |
gpuci run tests |
1 similar comment
gpuci run tests |
Previous commits added support for compiling Python functions to CUDA LTO-IR via the compilation interfaces. This commit adds stub code for supporting compilation of `@cuda.jit`-decorated functions to LTO-IR. The only functional change, unused in Numba at present, is that if the linker has LTO enabled, the CUDA codegen uses NVVM to generate LTO-IR instead of PTX, and passes that to the linker. The `lto` attribute is added linker classes in `numba.cuda.cudadrv.driver` - this is always `False` for the built-in linkers, but a linker from pynvjitlink (or any other external linker, in theory) could set it to `True` to signal that LTO is enabled. Some tests must be skipped if LTO is enabled, because it becomes difficult to use the functionality they test when LTO is enabled: - Some inspect the PTX, which is difficult to do when LTO-IR is generated instead. - Others check for exceptions, but the exception flags get optimized away by LTO because Numba fails to add them to the used list (See numba#9526).
gpuci run tests |
This is needed to allow the "skip under LTO" test functionality to run successfully (and not skip on the simulator, since it does not simulate LTO).
gpuci run tests |
- `compile_for_current_device()` needs an `output` kwarg so it can generate LTO-IR or PTX. - `compile_ptx()` now calls `compile()` with an explicit `output` kwarg so that it compiles to PTX even if the default for `compile()` changes in future. - `compile_ptx_for_current_device()` now calls `compile_ptx()` with the CC for the current device.
This is just implementing `get_asm_str()` now, which is part of the codegen object's interface. To align better with the rest of Numba, the `_get_ptx()` body is moved into `get_asm_str()` and `get_asm_str()` is used in its place.
gpuci run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @gmarkall, it's great to see this feature implemented. This has been through an OOB pair review between us already in which the feature, expectations and implementation were discussed. The review provided below is just catching a few small things in the resultant change set, the contents otherwise is good. Thanks again for working on this!
- Wording edits to docs on CUDA compilation. - Check for `if cc is not None` rather than just `if cc`, etc., in the codegen, for greater robustness. - Add a test that checks the error reported when specifying an illegal output kind. - Cross-reference numba#9526 in the comment in `TestUserUxc`.
gpuci run tests |
@stuartarchibald Many thanks for the review - I believe all comments are now addressed, and I'm just waiting on CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch and fixes!
This adds support for compiling to LTO-IR, providing an alternative route to using PTX for linking code with non-Python source code with greater potential for optimization from being able to optimize at link-time across the whole body of source for different languages.
A summary of the changes: