Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF][BUG]: thrust::transform does not saturate bandwidth on newer hardware architectures (down to 62% SoL on H200 for int) #1673

Open
1 task done
ahendriksen opened this issue Apr 29, 2024 · 0 comments
Assignees
Labels
bug Something isn't working right.

Comments

@ahendriksen
Copy link
Contributor

ahendriksen commented Apr 29, 2024

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Using thrust::transform on newer hardware platforms can result in subpar performance.

How to Reproduce

See godbolt link for exact reproducer.

Output:

benchmark   type             cp_gb  elapsed_ms     bw_gbps  pct_of_sol
mul         int8              8.59      8.2682      1038.9       21.1%
add         int8             12.88      9.2302      1396.0       28.4%
triad       int8             12.88      9.2364      1395.0       28.4%
nstream     int8             17.18      9.9301      1730.1       35.2%
mul         int16             8.59      4.6822      1834.6       37.3%
add         int16            12.88      5.2990      2431.6       49.5%
triad       int16            12.88      5.3074      2427.7       49.4%
nstream     int16            17.18      5.9318      2896.2       58.9%
mul         int32             8.59      2.8014      3066.3       62.4%
add         int32            12.88      3.4908      3691.1       75.1%
triad       int32            12.88      3.4901      3691.8       75.1%
nstream     int32            17.18      4.2756      4018.1       81.7%
mul         int64             8.59      2.1956      3912.3       79.6%
add         int64            12.88      2.9556      4359.5       88.7%
triad       int64            12.88      2.9548      4360.7       88.7%
nstream     int64            17.18      3.9255      4376.5       89.0%
mul         int128           17.18      4.0780      4212.9       85.7%
add         int128           25.77      5.9694      4317.0       87.8%
triad       int128           25.77      5.9789      4310.1       87.7%
nstream     int128           34.36      7.8597      4371.6       88.9%

Expected behavior

The benchmarks with int32 datatype should be able to saturate bandwidth (~90%). The benchmarks with int16 and int8 datatypes should have reasonable performance (>60%). The int64 mul benchmark should be at 90% SoL.

The int128, and the remaining int64 benchmarks have been added as a reference. Their performance is acceptable.

Reproduction link

https://godbolt.org/z/K7EW5freK

Operating System

No response

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   28C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

@ahendriksen ahendriksen added the bug Something isn't working right. label Apr 29, 2024
@bernhardmgruber bernhardmgruber self-assigned this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

2 participants