Skip to content

How To Benchmark Torch‐TensorRT with TorchBench

George S edited this page Jul 26, 2023 · 1 revision

Benchmarking Torch-TRT

We have added support for benchmarking Torch-TRT across IRs (torchscript, torch_compile, dynamo) in TorchBench, which features a set of key models, and the extensibility to add easily to those.

Setup

First, it is key to set up a clean environment for benchmarking. We have two recommended ways to accomplish this.

  1. Set up a container based on the provided TorchBench Dockerfiles, then install torch_tensorrt in it.
  2. Set up a container based on the Torch-TRT Docker, then install torchbench in it.

With the environment set up, benchmarking Torch-TRT in TorchBench can be done in the following ways (from the root of the TorchBench clone).

General Usage

# Prints metrics to stdout
python run.py {MODEL selected from TorchBench set} -d cuda -t eval --backend torch_trt --precision [fp32 OR fp16] [Torch-TRT specific options, see below]
# Saves metrics to .userbenchmark/torch_trt/metrics-*.json
python run_benchmark.py torch_trt --model {MODEL selected from TorchBench set} --precision [fp32 OR fp16] [Torch-TRT specific options, see below]
--truncate_long_and_double: Whether to automatically truncate long and double operations
--min_block_size: Minimum number of operations in an accelerated TRT block
--workspace_size: Size of workspace allotted to TensorRT
--ir: Which internal representation to use: {"ts", "torch_compile", "dynamo", ...}

Benchmarking a Single Model, Printing Metrics Directly

# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run.py resnet18 -d cuda -t eval --backend torch_trt --precision fp32 --truncate_long_and_double --ir torchscript
# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run.py vgg16 -d cuda -t eval --backend torch_trt --precision fp16 --ir dynamo --bs 32
# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run.py BERT_pytorch -d cuda -t eval --backend torch_trt --precision fp16 --truncate_long_and_double --ir torch_compile

Benchmarking a Single Model, Saving Metrics to File

In both of the cases below, the metrics will be saved to files at the path .userbenchmark/torch_trt/metrics-*.json, as per the TorchBench

# Benchmarks ResNet18 with Torch-TRT, using FP32 precision, truncate_long_and_double=True, and compiling via the TorchScript path
python run_benchmark.py torch_trt --model resnet18 --precision fp32 --truncate_long_and_double --ir torchscript
# Benchmarks VGG16 with Torch-TRT, using FP16 precision, Batch Size 32, and compiling via the dynamo path
python run_benchmark.py torch_trt --model vgg16--precision fp16 --ir dynamo --bs 32
# Benchmarks BERT with Torch-TRT, using FP16 precision, truncate_long_and_double=True, and compiling via the torch compile path
python run_benchmark.py torch_trt --model BERT_pytorch --precision fp16 --truncate_long_and_double --ir torch_compile

Benchmarking Many Models, Saving Metrics to File

In the future, we hope to enable:

# Benchmarks all TorchBench models with Torch-TRT, compiling via the torch compile path
python run_benchmark.py torch_trt --precision fp16 --ir torch_compile

Currently, this is still in development, and the recommended method to benchmark multiple models is to make a bash script which iterates over the set of desired models and runs the individual benchmarks of those. See the discussion here, for more details.