CUDAGraphs in Torch-TRT #2736

gs-olive · 2024-04-09T16:31:44Z

gs-olive
Apr 9, 2024
Collaborator

CUDAGraphs in Torch-TRT

TL;DR

Enable CUDAGraphs model acceleration in Torch-TRT, to enhance performance by hiding kernel launch time bottlenecks. CUDAGraphs is enabled via a compile boolean argument, which enables the functionality. Some considerations remain regarding dynamic shapes.

Goal(s)

Boost performance via the use of CUDAGraphs to group kernels into units, thereby improving kernel launch times and decreasing overhead.

Usecases

Proposed APIs / UX

The API would be invoked via argument to torch_tensorrt.compile, as so:

torch_tensorrt.compile(..., cudagraphs=True, ...)

Then, the backend would attempt to compile the graph with the specified input shapes, using CUDAGraphs. This argument provides a layer of abstraction to the user, where the background work of CUDAGraphs is handled by Torch-TensorRT and the acceleration benefits are immediate.

Limitations

This feature will not work with arbitrary dynamic-shape inputs without user-provided padding. This is on account of memory and recompilation limitations. Some consideration is needed to define a mechanism for handling dynamic shape models, including when to dispatch to the CUDAGraphs implementation vs using eager execution as a fallback.

Internal Implementation

Design

The key requirements of this feature are an instantiation, building, and storage mechanism for the CUDAGraphs objects.

Extensions Required to Core API implementations

The main APIs needing to change are the TRT runtime modules. There are a few candidate options for modifications:

1. All Recording in Python
Recording graph operations can be done post TRT engine building as a wrapper over the inference execution. Some options for this include:

One detail which is yet-unclear about this component is whether recording the completed TRT Engine will include kernel artifacts which are undesired in the end-result.

2. Recording at Engine Build Time
As recommended by TRT here, the execute_async call can be recorded to capture the sequence of kernels run in a narrow and precise scope. This is likely the better approach, though it would require more effort since it needs C++ modifications for the TorchTensorRTModule and Python modifications for the PythonTorchTensorRTModule.

##### Snippet adapted from https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-graphs
from cuda import cudart
err, stream = cudart.cudaStreamCreate()

# Call execute_async_v3() once after an input shape change to update internal state.
context.execute_async_v3(stream);

# Capture a CUDA graph instance
cudaStreamBeginCapture(stream, cudart.cudaStreamCaptureModeGlobal)
context.execute_async_v3(stream)
err, graph = cudart.cudaStreamEndCapture(stream)
err, instance = cudart.cudaGraphInstantiate(graph, 0)

# To run inference, launch the graph instead of calling execute_async_v3().
for i in range(iterations):
    cudart.cudaGraphLaunch(instance, stream)
    cudart.cudaStreamSynchronize(stream)

CUDAGraphs also requires a degree of control over the CUDA stream actively in use within the user context. This includes copying input tensors to precise memory locations and other considerations for managed memory. Approach 2 above alleviates this a bit by relying on the TRT Module to do at least a portion of stream-management. Some of the methods in Approach 1 handle stream logic behind the scenes as well.

Data Structures

The CUDAGraphs will require a storage mechanism. This will likely depend on the selected ir to an extent. The graph object itself will be stored as a field of the selected runtime module. At first, there can be a single CUDAGraph, compiled for the first shape which the model encounters. Later, extensions can be made to construct a dictionary, keyed on input shapes or dynamic dimensions, to support saving multiple CUDAGraphs.

ir="dynamo"
When using ir="dynamo", we do not have the capability to recompile the model or re-record the CUDAGraph at runtime with a different dynamic shape. Therefore, we either need to store multiple CUDAGraphs corresponding to common shapes, or a single CUDAGraph to which other shapes are "paddable".

To start, we can have a CUDAGraph be generated for the opt shape as a default case, and then expand to other more advanced cases later.

ir="torch_compile"
In the torch.compile case, the capability for recompilation unlocks additional applications for CUDAGraphs. In much the same way that mode="reduce-overhead" provides CUDAGraphs capabilities for Inductor, along with dynamic recompilation on shape changes, we can use this model for supporting CUDAGraphs with dynamic shapes

Still, to start it would be easiest to generate the CUDAGraph as part of compilation and store only one graph per engine.

Details specific for TorchScript Support

TorchScript support should follow if Approach 2 is taken, since the runtime modifications will be portable to TorchScript if they are functional for ir="dynamo".

Implementation Phases

Prototype - Small/Medium

Attempt wrapping the compilation call with one of the higher level APIs mentioned in Approach 1. Determine if any speedup is noticed on key models.
Integrate a draft version of Approach 2 into the Python runtime only
Use only static shape cases, assume no serialization (for now)

MVP (`2.4.0`) - Medium/Large

Integrate Approach 2 into the C++ runtime
Add support for serialization with CUDAGraphs
(Optional) Add support for dynamic shape cases

Extension Phase 1 - Medium

Add functionality similar to mode="reduce-overhead", with dynamic shape support for torch.compile and smart recompilation leveraging PyTorch guards. Consider integrating with or using the Inductor functionalities directly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDAGraphs in Torch-TRT #2736

{{title}}

Replies: 0 comments

Select a reply

CUDAGraphs in Torch-TRT #2736

gs-olive Apr 9, 2024 Collaborator

CUDAGraphs in Torch-TRT

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Limitations

Internal Implementation

Design

Extensions Required to Core API implementations

Data Structures

Details specific for TorchScript Support

Implementation Phases

Prototype - Small/Medium

MVP (2.4.0) - Medium/Large

Extension Phase 1 - Medium

Replies: 0 comments

gs-olive
Apr 9, 2024
Collaborator

MVP (`2.4.0`) - Medium/Large