How To Benchmark Torch‐TensorRT with TorchBench

Benchmarking Torch-TRT

We have added support for benchmarking Torch-TRT across IRs (torchscript, torch_compile, dynamo) in TorchBench, which features a set of key models, and the extensibility to add easily to those.

Setup

First, it is key to set up a clean environment for benchmarking. We have two recommended ways to accomplish this.

Set up a container based on the provided TorchBench Dockerfiles, then install torch_tensorrt in it.
Set up a container based on the Torch-TRT Docker, then install torchbench in it.

With the environment set up, benchmarking Torch-TRT in TorchBench can be done in the following ways (from the root of the TorchBench clone).

General Usage

# Prints metrics to stdout
python run.py {MODEL selected from TorchBench set} -d cuda -t eval --backend torch_trt --precision [fp32 OR fp16] [Torch-TRT specific options, see below]
# Saves metrics to .userbenchmark/torch_trt/metrics-*.json
python run_benchmark.py torch_trt --model {MODEL selected from TorchBench set} --precision [fp32 OR fp16] [Torch-TRT specific options, see below]

--truncate_long_and_double: Whether to automatically truncate long and double operations
--min_block_size: Minimum number of operations in an accelerated TRT block
--workspace_size: Size of workspace allotted to TensorRT
--ir: Which internal representation to use: {"ts", "torch_compile", "dynamo", ...}

Benchmarking a Single Model, Printing Metrics Directly

# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run.py resnet18 -d cuda -t eval --backend torch_trt --precision fp32 --truncate_long_and_double --ir torchscript

# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run.py vgg16 -d cuda -t eval --backend torch_trt --precision fp16 --ir dynamo --bs 32

# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run.py BERT_pytorch -d cuda -t eval --backend torch_trt --precision fp16 --truncate_long_and_double --ir torch_compile

Benchmarking a Single Model, Saving Metrics to File

In both of the cases below, the metrics will be saved to files at the path .userbenchmark/torch_trt/metrics-*.json, as per the TorchBench

# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run_benchmark.py torch_trt --model resnet18 --precision fp32 --truncate_long_and_double --ir torchscript

# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run_benchmark.py torch_trt --model vgg16--precision fp16 --ir dynamo --bs 32

# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run_benchmark.py torch_trt --model BERT_pytorch --precision fp16 --truncate_long_and_double --ir torch_compile

Benchmarking Many Models, Saving Metrics to File

In the future, we hope to enable:

# Benchmarks all TorchBench models with Torch-TRT, compiling via the torch compile path
python run_benchmark.py torch_trt --precision fp16 --ir torch_compile

Currently, this is still in development, and the recommended method to benchmark multiple models is to make a bash script which iterates over the set of desired models and runs the individual benchmarks of those. See the discussion here, for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly