Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking inference of TensorRT 8.6.3 using trtexec on GPU RTX 4090 #3857

Open
bernardrb opened this issue May 11, 2024 · 3 comments
Open
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@bernardrb
Copy link

Description

We're benchmarking my mixed-precision models using:

trtexec --loadEngine=model.engine --useCudaGraph --iterations=100 --avgRuns=100

We compared two models, one baseline in FP16, and another model where we reduced the precision for the first "stage" in our model, and let the rest continue to be in FP16. However, we don't know if we can trust the results.

For the baseline model inspect.txt (layerInfo), the performance summary was:

[05/11/2024-10:13:11] [I] === Trace details ===
[05/11/2024-10:13:11] [I] Trace averages of 100 runs:
[05/11/2024-10:13:11] [I] Average on 100 runs - GPU latency: 13.8122 ms - Host latency: 14.458 ms (enqueue 0.0187917 ms)
[05/11/2024-10:13:11] [I] Average on 100 runs - GPU latency: 13.8192 ms - Host latency: 14.4658 ms (enqueue 0.0200439 ms)
[05/11/2024-10:13:11] [I] 
[05/11/2024-10:13:11] [I] === Performance summary ===
[05/11/2024-10:13:11] [I] Throughput: 72.047 qps
[05/11/2024-10:13:11] [I] Latency: min = 14.4292 ms, max = 14.6636 ms, mean = 14.462 ms, median = 14.4604 ms, percentile(90%) = 14.4785 ms, percentile(95%) = 14.4811 ms, percentile(99%) = 14.4929 ms
[05/11/2024-10:13:11] [I] Enqueue Time: min = 0.00354004 ms, max = 0.0480042 ms, mean = 0.0194845 ms, median = 0.0201111 ms, percentile(90%) = 0.0214844 ms, percentile(95%) = 0.0220337 ms, percentile(99%) = 0.0231934 ms
[05/11/2024-10:13:11] [I] H2D Latency: min = 0.473633 ms, max = 0.507935 ms, mean = 0.486256 ms, median = 0.486084 ms, percentile(90%) = 0.493652 ms, percentile(95%) = 0.495605 ms, percentile(99%) = 0.503204 ms
[05/11/2024-10:13:11] [I] GPU Compute Time: min = 13.79 ms, max = 14.0186 ms, mean = 13.8158 ms, median = 13.8147 ms, percentile(90%) = 13.8293 ms, percentile(95%) = 13.8333 ms, percentile(99%) = 13.8413 ms
[05/11/2024-10:13:11] [I] D2H Latency: min = 0.158203 ms, max = 0.163086 ms, mean = 0.159976 ms, median = 0.159546 ms, percentile(90%) = 0.162109 ms, percentile(95%) = 0.162598 ms, percentile(99%) = 0.163025 ms
[05/11/2024-10:13:11] [I] Total Host Walltime: 3.03968 s
[05/11/2024-10:13:11] [I] Total GPU Compute Time: 3.02566 s

We then compared to inspect.txt, the performance summary was:

[05/11/2024-10:10:49] [I] === Trace details ===
[05/11/2024-10:10:49] [I] Trace averages of 100 runs:
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.75399 ms - Host latency: 4.38808 ms (enqueue 0.00765167 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.69356 ms - Host latency: 4.32738 ms (enqueue 0.00752808 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.77367 ms - Host latency: 4.40745 ms (enqueue 0.00756958 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.75653 ms - Host latency: 4.39059 ms (enqueue 0.00737061 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.72422 ms - Host latency: 4.3582 ms (enqueue 0.00747681 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.78557 ms - Host latency: 4.41997 ms (enqueue 0.00754883 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.73969 ms - Host latency: 4.37361 ms (enqueue 0.00767334 ms)
[05/11/2024-10:10:49] [I] Average on 100 runs - GPU latency: 3.7509 ms - Host latency: 4.38498 ms (enqueue 0.00746826 ms)
[05/11/2024-10:10:49] [I] 
[05/11/2024-10:10:49] [I] === Performance summary ===
[05/11/2024-10:10:49] [I] Throughput: 266.393 qps
[05/11/2024-10:10:49] [I] Latency: min = 4.30334 ms, max = 4.58621 ms, mean = 4.38137 ms, median = 4.38037 ms, percentile(90%) = 4.42969 ms, percentile(95%) = 4.43506 ms, percentile(99%) = 4.58322 ms
[05/11/2024-10:10:49] [I] Enqueue Time: min = 0.00561523 ms, max = 0.017334 ms, mean = 0.00753628 ms, median = 0.00756836 ms, percentile(90%) = 0.00793457 ms, percentile(95%) = 0.00805664 ms, percentile(99%) = 0.0090332 ms
[05/11/2024-10:10:49] [I] H2D Latency: min = 0.472656 ms, max = 0.481323 ms, mean = 0.474178 ms, median = 0.474121 ms, percentile(90%) = 0.475342 ms, percentile(95%) = 0.475586 ms, percentile(99%) = 0.476074 ms
[05/11/2024-10:10:49] [I] GPU Compute Time: min = 3.66797 ms, max = 3.95267 ms, mean = 3.74735 ms, median = 3.74585 ms, percentile(90%) = 3.7959 ms, percentile(95%) = 3.80103 ms, percentile(99%) = 3.94754 ms
[05/11/2024-10:10:49] [I] D2H Latency: min = 0.158325 ms, max = 0.162842 ms, mean = 0.15984 ms, median = 0.159912 ms, percentile(90%) = 0.160461 ms, percentile(95%) = 0.160645 ms, percentile(99%) = 0.160889 ms
[05/11/2024-10:10:49] [I] Total Host Walltime: 3.01059 s
[05/11/2024-10:10:49] [I] Total GPU Compute Time: 3.00537 s

While the results are promising, we are being troubled by how the quantization of one stage can lead to such a huge latency improvement. Can we trust these results?

Environment

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
| 40%   32C    P8              5W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

TensorRT Version: 8.6.3

Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:24.02-py3

Relevant Files

Model link:

Both models included:

https://drive.google.com/drive/folders/1MJAP7NDO7zzRJlUJFexpTcxKVWT9tnuP?usp=sharing

@zerollzeng
Copy link
Collaborator

  1. Did you fix the GPU clock?
  2. Can you make sure that there is no other task running when doing perf test?
  3. You can try use --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --useCudaGraph --noDataTransfers --useSpinWait and you'll see the layer profile. then you can know whether it's expected.

@zerollzeng zerollzeng self-assigned this May 17, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label May 17, 2024
@bernardrb
Copy link
Author

(.1 & .2) We didn't fix the GPU clock. We only have one other processes running on the GPU:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
  1. Are you recommending to compare the averageMs to the GPU compute time?

@bernardrb
Copy link
Author

@johnyang-nv

I saw your issue About TensorRT Latency Measure, and thought maybe you had some insight on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants