Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only ~783GByte/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 #1264

Open
OrenLeung opened this issue Apr 24, 2024 · 2 comments
Open

Only ~783GByte/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 #1264

OrenLeung opened this issue Apr 24, 2024 · 2 comments

Comments

@OrenLeung
Copy link

OrenLeung commented Apr 24, 2024

Hi! I am running the nvidia provided p2p bandwidth test and only achieved bidirectional of 749GB/s out of the marketed theoretical 900GB/s and unidirectional 380GB/s out of 450GB/s theoretical on H100 SXM Nvlink4. I see that @stas00 was only able to achieve 376GB/s too. stas results

749 of out 900 means there even in the best case of this p2p test that it was only able to achieve 83% of the marketed theoretical peak bandwidth.

Reprod Script, Full Output and Full Setup is provided below for our connivence. Please let me know if this is expected or am I missing something?

Bidirectional results (748GB/s out of 900GB/s)

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2573.14 742.21 744.11 749.03 740.60 742.12 742.30 741.20 
     1 744.49 2578.12 741.71 742.24 741.40 741.98 743.15 741.97 
     2 740.63 773.25 2569.51 744.03 741.82 741.31 752.89 749.18 
     3 739.42 742.34 772.86 2574.53 742.26 741.05 741.21 741.51 
     4 748.38 742.66 740.10 741.47 2573.67 741.88 742.54 741.77 
     5 748.36 741.19 740.85 741.14 740.29 2578.38 744.75 742.73 
     6 748.86 741.66 741.87 743.75 739.95 741.62 2572.61 741.75 
     7 748.57 741.45 741.02 743.01 741.86 741.53 740.83 2576.72 

Undirectional Results (380GB/s out of 450GB/s)

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2494.76 371.02 376.65 376.60 376.82 375.59 375.94 375.60 
     1 376.53 2528.57 376.11 376.37 376.18 377.11 375.93 376.86 
     2 368.36 393.03 2514.46 378.81 376.32 375.96 376.23 376.12 
     3 381.44 375.28 392.30 2519.65 376.47 376.62 375.99 380.77 
     4 379.53 375.94 375.42 392.11 2510.29 375.40 376.02 375.61 
     5 378.61 376.63 377.58 376.30 376.09 2520.54 376.35 375.54 
     6 379.78 376.04 375.99 376.17 376.50 376.45 2519.53 375.25 
     7 380.27 376.79 375.69 375.63 376.25 376.38 376.12 2519.91 

Full Results Logs

Toggle to see full p2pbandwidthlatencytest output $ ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 80GB HBM3, pciBusID: 18, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 80GB HBM3, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 80GB HBM3, pciBusID: 3a, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 80GB HBM3, pciBusID: 5d, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 80GB HBM3, pciBusID: 9a, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 80GB HBM3, pciBusID: ab, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 80GB HBM3, pciBusID: ba, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 80GB HBM3, pciBusID: db, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2474.76 37.19 37.20 36.91 37.06 37.16 37.28 37.24
1 36.36 2512.44 37.35 36.32 36.95 37.12 36.74 36.30
2 36.34 36.27 2498.88 36.46 37.51 37.54 36.37 37.23
3 36.74 37.12 37.23 2499.38 37.00 37.86 36.69 37.02
4 36.87 37.10 37.36 37.14 2499.63 37.26 37.25 37.47
5 37.45 37.52 37.02 37.56 37.07 2514.71 38.05 37.26
6 36.79 36.40 37.49 37.41 37.31 37.10 2504.76 37.38
7 37.02 37.38 37.24 37.09 37.92 37.46 37.38 2503.38
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2494.76 371.02 376.65 376.60 376.82 375.59 375.94 375.60
1 376.53 2528.57 376.11 376.37 376.18 377.11 375.93 376.86
2 368.36 393.03 2514.46 378.81 376.32 375.96 376.23 376.12
3 381.44 375.28 392.30 2519.65 376.47 376.62 375.99 380.77
4 379.53 375.94 375.42 392.11 2510.29 375.40 376.02 375.61
5 378.61 376.63 377.58 376.30 376.09 2520.54 376.35 375.54
6 379.78 376.04 375.99 376.17 376.50 376.45 2519.53 375.25
7 380.27 376.79 375.69 375.63 376.25 376.38 376.12 2519.91
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2573.47 44.34 44.26 44.08 50.61 51.42 50.92 50.89
1 45.51 2579.38 45.63 45.12 51.87 52.10 51.28 51.96
2 43.98 44.81 2576.59 44.28 51.03 51.37 50.54 51.37
3 43.96 44.77 44.60 2579.18 51.01 50.91 51.17 50.50
4 50.88 51.46 50.95 50.70 2580.71 51.44 51.62 51.43
5 51.21 50.97 51.35 50.77 51.18 2577.78 51.51 51.32
6 50.89 50.99 50.82 50.91 52.17 51.24 2578.98 51.73
7 50.84 51.48 51.18 51.18 51.38 51.69 51.49 2579.91
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2573.14 742.21 744.11 749.03 740.60 742.12 742.30 741.20
1 744.49 2578.12 741.71 742.24 741.40 741.98 743.15 741.97
2 740.63 773.25 2569.51 744.03 741.82 741.31 752.89 749.18
3 739.42 742.34 772.86 2574.53 742.26 741.05 741.21 741.51
4 748.38 742.66 740.10 741.47 2573.67 741.88 742.54 741.77
5 748.36 741.19 740.85 741.14 740.29 2578.38 744.75 742.73
6 748.86 741.66 741.87 743.75 739.95 741.62 2572.61 741.75
7 748.57 741.45 741.02 743.01 741.86 741.53 740.83 2576.72
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.38 13.56 12.75 12.92 13.35 13.31 15.03 19.14
1 14.85 2.14 13.06 13.67 13.83 14.08 13.77 15.11
2 13.41 12.83 2.31 13.39 18.86 20.05 21.71 13.84
3 12.52 13.17 13.22 2.18 13.42 14.29 15.51 14.92
4 12.89 13.60 13.19 13.11 2.32 12.82 21.12 21.12
5 12.78 12.91 12.67 12.34 21.11 2.16 12.70 12.69
6 12.65 14.00 12.62 12.84 21.27 21.35 2.22 12.96
7 12.75 13.16 13.49 12.82 12.78 12.78 21.37 2.13

CPU 0 1 2 3 4 5 6 7
0 2.29 6.90 6.73 6.79 6.23 6.29 6.36 6.20
1 6.74 2.27 6.77 6.92 6.28 6.48 6.39 6.31
2 6.76 6.80 2.11 6.89 6.28 6.44 6.34 6.17
3 6.65 6.79 6.69 2.14 6.40 6.40 6.41 6.22
4 6.37 6.49 6.36 6.43 2.04 6.57 6.52 6.35
5 6.97 7.00 6.85 6.97 6.05 2.02 6.10 5.97
6 6.40 6.52 6.40 6.49 6.02 6.00 2.00 5.94
7 6.39 6.44 6.34 6.45 6.02 6.02 5.94 1.98
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.36 3.29 2.26 2.77 2.77 2.77 2.25 2.81
1 2.27 2.11 2.27 2.31 2.26 2.80 2.25 2.83
2 2.27 2.77 2.33 2.78 2.78 2.75 2.78 2.28
3 3.31 2.79 2.83 2.19 2.30 2.80 2.80 2.80
4 2.93 2.40 2.95 2.89 2.34 2.89 2.90 2.93
5 2.38 2.33 2.33 2.33 2.36 2.14 2.34 2.32
6 2.94 2.87 2.91 2.34 2.87 2.95 2.24 2.92
7 2.91 2.33 2.89 2.88 2.88 2.88 2.87 2.12

CPU 0 1 2 3 4 5 6 7
0 2.24 1.79 1.77 1.77 1.78 1.78 1.79 1.76
1 1.86 2.23 1.82 1.82 1.79 1.83 1.81 1.80
2 1.85 1.81 2.22 1.80 1.80 1.85 1.80 1.78
3 1.89 1.82 1.81 2.21 1.85 1.79 1.81 1.81
4 1.69 1.66 1.65 1.66 1.99 1.64 1.69 1.67
5 1.73 1.75 1.67 1.68 1.69 2.02 1.69 1.69
6 1.73 1.68 1.70 1.74 1.70 1.72 2.07 1.71
7 1.79 1.68 1.76 1.68 1.68 1.70 1.69 2.03

Reprod

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest

Setup

  • Bare Metal Supermicro 8xH100 SXM Intel Server GPU SuperServer SYS-821GE-TNHR
  • cuda version 12.4
  • driver version 550.54.15
  • fabric manager 550.54.15
  • supermicro BIOS version: 2.1
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   28C    P0             88W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   31C    P0             83W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   31C    P0             81W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   27C    P0             79W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9A:00.0 Off |                    0 |
| N/A   28C    P0             80W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:AB:00.0 Off |                    0 |
| N/A   30C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:BA:00.0 Off |                    0 |
| N/A   32C    P0             83W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   29C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Nvidia-smi Topology (it is NV18 across all to all gpus within a node)

nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55,112-167	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55,112-167	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55,112-167	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	0-55,112-167	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	56-111,168-223	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	56-111,168-223	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	SYS	56-111,168-223	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	56-111,168-223	1		N/A
NIC0	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC1	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC2	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC3	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS				
NIC4	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS				
NIC5	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS				
NIC6	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS				
NIC7	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS				
NIC8	SYS	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS				
NIC9	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
@OrenLeung OrenLeung changed the title Only 749GBit/s out of theoretical 900GBit/s HGX H100 SXM Nvlink4 Only 749GB/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 Apr 24, 2024
@AddyLaddy
Copy link
Collaborator

This is not a NCCL issue. I suggest you contact your vendor or Nvidia technical sales representative.

@OrenLeung
Copy link
Author

This is not a NCCL issue. I suggest you contact your vendor or Nvidia technical sales representative.

thanks.

on the bright side, i am now at 783Gbytes/s by using the max shape. but i am still missing ~100GByte/s

./p2pBandwidthLatencyTest --numElems=10474830000

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2575.02 783.54 783.90 783.87 783.59 783.83 783.73 783.72 
     1 783.58 2662.83 783.63 783.72 783.69 783.73 783.56 783.67 
     2 783.85 783.69 2661.70 783.89 783.81 783.82 783.67 783.78 
     3 783.66 783.68 783.73 2662.10 783.86 783.94 783.80 783.65 
     4 783.80 783.70 783.64 783.83 2662.24 783.83 783.76 783.88 
     5 783.73 783.67 783.87 783.91 783.60 2662.16 783.55 783.84 
     6 783.61 783.69 783.71 783.69 783.69 783.76 2663.57 783.61 
     7 783.87 783.58 783.76 783.67 783.91 783.94 783.66 2662.34 

@OrenLeung OrenLeung changed the title Only 749GB/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 Only ~783GByte/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants