You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enable CUDAGraphs model acceleration in Torch-TRT, to enhance performance by hiding kernel launch time bottlenecks. CUDAGraphs is enabled via a compile boolean argument, which enables the functionality. Some considerations remain regarding dynamic shapes.
Goal(s)
Boost performance via the use of CUDAGraphs to group kernels into units, thereby improving kernel launch times and decreasing overhead.
Usecases
Proposed APIs / UX
The API would be invoked via argument to torch_tensorrt.compile, as so:
torch_tensorrt.compile(..., cudagraphs=True, ...)
Then, the backend would attempt to compile the graph with the specified input shapes, using CUDAGraphs. This argument provides a layer of abstraction to the user, where the background work of CUDAGraphs is handled by Torch-TensorRT and the acceleration benefits are immediate.
Limitations
This feature will not work with arbitrary dynamic-shape inputs without user-provided padding. This is on account of memory and recompilation limitations. Some consideration is needed to define a mechanism for handling dynamic shape models, including when to dispatch to the CUDAGraphs implementation vs using eager execution as a fallback.
Internal Implementation
Design
The key requirements of this feature are an instantiation, building, and storage mechanism for the CUDAGraphs objects.
Extensions Required to Core API implementations
The main APIs needing to change are the TRT runtime modules. There are a few candidate options for modifications:
1. All Recording in Python
Recording graph operations can be done post TRT engine building as a wrapper over the inference execution. Some options for this include:
One detail which is yet-unclear about this component is whether recording the completed TRT Engine will include kernel artifacts which are undesired in the end-result.
2. Recording at Engine Build Time
As recommended by TRT here, the execute_async call can be recorded to capture the sequence of kernels run in a narrow and precise scope. This is likely the better approach, though it would require more effort since it needs C++ modifications for the TorchTensorRTModule and Python modifications for the PythonTorchTensorRTModule.
##### Snippet adapted from https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-graphsfromcudaimportcudarterr, stream=cudart.cudaStreamCreate()
# Call execute_async_v3() once after an input shape change to update internal state.context.execute_async_v3(stream);
# Capture a CUDA graph instancecudaStreamBeginCapture(stream, cudart.cudaStreamCaptureModeGlobal)
context.execute_async_v3(stream)
err, graph=cudart.cudaStreamEndCapture(stream)
err, instance=cudart.cudaGraphInstantiate(graph, 0)
# To run inference, launch the graph instead of calling execute_async_v3().foriinrange(iterations):
cudart.cudaGraphLaunch(instance, stream)
cudart.cudaStreamSynchronize(stream)
CUDAGraphs also requires a degree of control over the CUDA stream actively in use within the user context. This includes copying input tensors to precise memory locations and other considerations for managed memory. Approach 2 above alleviates this a bit by relying on the TRT Module to do at least a portion of stream-management. Some of the methods in Approach 1 handle stream logic behind the scenes as well.
Data Structures
The CUDAGraphs will require a storage mechanism. This will likely depend on the selected ir to an extent. The graph object itself will be stored as a field of the selected runtime module. At first, there can be a single CUDAGraph, compiled for the first shape which the model encounters. Later, extensions can be made to construct a dictionary, keyed on input shapes or dynamic dimensions, to support saving multiple CUDAGraphs.
ir="dynamo"
When using ir="dynamo", we do not have the capability to recompile the model or re-record the CUDAGraph at runtime with a different dynamic shape. Therefore, we either need to store multiple CUDAGraphs corresponding to common shapes, or a single CUDAGraph to which other shapes are "paddable".
To start, we can have a CUDAGraph be generated for the opt shape as a default case, and then expand to other more advanced cases later.
ir="torch_compile"
In the torch.compile case, the capability for recompilation unlocks additional applications for CUDAGraphs. In much the same way that mode="reduce-overhead" provides CUDAGraphs capabilities for Inductor, along with dynamic recompilation on shape changes, we can use this model for supporting CUDAGraphs with dynamic shapes
Still, to start it would be easiest to generate the CUDAGraph as part of compilation and store only one graph per engine.
Details specific for TorchScript Support
TorchScript support should follow if Approach 2 is taken, since the runtime modifications will be portable to TorchScript if they are functional for ir="dynamo".
Implementation Phases
Prototype - Small/Medium
Attempt wrapping the compilation call with one of the higher level APIs mentioned in Approach 1. Determine if any speedup is noticed on key models.
Integrate a draft version of Approach 2 into the Python runtime only
Use only static shape cases, assume no serialization (for now)
MVP (2.4.0) - Medium/Large
Integrate Approach 2 into the C++ runtime
Add support for serialization with CUDAGraphs
(Optional) Add support for dynamic shape cases
Extension Phase 1 - Medium
Add functionality similar to mode="reduce-overhead", with dynamic shape support for torch.compile and smart recompilation leveraging PyTorch guards. Consider integrating with or using the Inductor functionalities directly
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
CUDAGraphs in Torch-TRT
TL;DR
Enable CUDAGraphs model acceleration in Torch-TRT, to enhance performance by hiding kernel launch time bottlenecks. CUDAGraphs is enabled via a
compile
boolean argument, which enables the functionality. Some considerations remain regarding dynamic shapes.Goal(s)
Boost performance via the use of CUDAGraphs to group kernels into units, thereby improving kernel launch times and decreasing overhead.
Usecases
Proposed APIs / UX
The API would be invoked via argument to
torch_tensorrt.compile
, as so:Then, the backend would attempt to compile the graph with the specified input shapes, using CUDAGraphs. This argument provides a layer of abstraction to the user, where the background work of CUDAGraphs is handled by Torch-TensorRT and the acceleration benefits are immediate.
Limitations
This feature will not work with arbitrary dynamic-shape inputs without user-provided padding. This is on account of memory and recompilation limitations. Some consideration is needed to define a mechanism for handling dynamic shape models, including when to dispatch to the CUDAGraphs implementation vs using eager execution as a fallback.
Internal Implementation
Design
The key requirements of this feature are an instantiation, building, and storage mechanism for the CUDAGraphs objects.
Extensions Required to Core API implementations
The main APIs needing to change are the TRT runtime modules. There are a few candidate options for modifications:
1. All Recording in Python
Recording graph operations can be done post TRT engine building as a wrapper over the inference execution. Some options for this include:
One detail which is yet-unclear about this component is whether recording the completed TRT Engine will include kernel artifacts which are undesired in the end-result.
2. Recording at Engine Build Time
As recommended by TRT here, the
execute_async
call can be recorded to capture the sequence of kernels run in a narrow and precise scope. This is likely the better approach, though it would require more effort since it needs C++ modifications for theTorchTensorRTModule
and Python modifications for thePythonTorchTensorRTModule
.CUDAGraphs also requires a degree of control over the CUDA stream actively in use within the user context. This includes copying input tensors to precise memory locations and other considerations for managed memory. Approach 2 above alleviates this a bit by relying on the TRT Module to do at least a portion of stream-management. Some of the methods in Approach 1 handle stream logic behind the scenes as well.
Data Structures
The CUDAGraphs will require a storage mechanism. This will likely depend on the selected
ir
to an extent. The graph object itself will be stored as a field of the selected runtime module. At first, there can be a single CUDAGraph, compiled for the first shape which the model encounters. Later, extensions can be made to construct a dictionary, keyed on input shapes or dynamic dimensions, to support saving multiple CUDAGraphs.ir="dynamo"
When using
ir="dynamo"
, we do not have the capability to recompile the model or re-record the CUDAGraph at runtime with a different dynamic shape. Therefore, we either need to store multiple CUDAGraphs corresponding to common shapes, or a single CUDAGraph to which other shapes are "paddable".To start, we can have a CUDAGraph be generated for the
opt
shape as a default case, and then expand to other more advanced cases later.ir="torch_compile"
In the
torch.compile
case, the capability for recompilation unlocks additional applications for CUDAGraphs. In much the same way thatmode="reduce-overhead"
provides CUDAGraphs capabilities for Inductor, along with dynamic recompilation on shape changes, we can use this model for supporting CUDAGraphs with dynamic shapesStill, to start it would be easiest to generate the CUDAGraph as part of compilation and store only one graph per engine.
Details specific for TorchScript Support
TorchScript support should follow if Approach 2 is taken, since the runtime modifications will be portable to TorchScript if they are functional for
ir="dynamo"
.Implementation Phases
Prototype - Small/Medium
MVP (
2.4.0
) - Medium/LargeExtension Phase 1 - Medium
mode="reduce-overhead"
, with dynamic shape support fortorch.compile
and smart recompilation leveraging PyTorch guards. Consider integrating with or using the Inductor functionalities directlyBeta Was this translation helpful? Give feedback.
All reactions