You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm running a simple test of all-reduce latency, comparing 2 nodes with 1 GPU each (world size 2) against 2 nodes with 2 GPUs each (world size 4). Each machines has 8xH100s, and this is just using SLURM to allocate 1 or 2 on each node. When allocating multiple GPUs on a node, I make sure that each process can see all the local GPUs (to try to avoid #1066). I'm running the following code snippet in a SLURM environment, all-reducing a 1GB tensor:
num_floats = 250000000
data = torch.rand(num_floats, dtype=torch.float32).to("cuda"")
for _ in range(5):
dist.all_reduce(data, async_op=False)
torch.cuda.synchronize()
start = time.time()
dist.all_reduce(data, async_op=False)
torch.cuda.synchronize()
end = time.time()
print(f"all-reduce took {end - start} seconds")
When i run this test with on 2 nodes with 1 GPU each (world size 2), it takes ~0.27 seconds. However, when I run this test on 2 nodes with 2 GPUs each (world size 4), it takes ~0.13 seconds. I was surprised by this result, since I expected latency to be higher when with more workers. If i run on 2 nodes with 4 GPUs each (world size 8), it's further halved to ~0.067 seconds.
Would anyone be able to explain what's going on here?
Hi! I'm running a simple test of all-reduce latency, comparing 2 nodes with 1 GPU each (world size 2) against 2 nodes with 2 GPUs each (world size 4). Each machines has 8xH100s, and this is just using SLURM to allocate 1 or 2 on each node. When allocating multiple GPUs on a node, I make sure that each process can see all the local GPUs (to try to avoid #1066). I'm running the following code snippet in a SLURM environment, all-reducing a 1GB tensor:
When i run this test with on 2 nodes with 1 GPU each (world size 2), it takes ~0.27 seconds. However, when I run this test on 2 nodes with 2 GPUs each (world size 4), it takes ~0.13 seconds. I was surprised by this result, since I expected latency to be higher when with more workers. If i run on 2 nodes with 4 GPUs each (world size 8), it's further halved to ~0.067 seconds.
Would anyone be able to explain what's going on here?
Further info about the system / logs:
Here's the output of
nvidia-smi topo -m
:Here are the NCCL logs from the 2x1 run: https://gist.github.com/nelson-liu/12271a4076e9572abe4cac83c8a289b3
Here are the NCCL logs from the 2x2 run: https://gist.github.com/nelson-liu/fedaa902e807d131242a2374167cb103
(let me know if there's a cleaner way of getting per-worker logs)
The ethernet interfaces
eth0
andeth1
are 100 Gbps, and rdma[0-15] are each 200 Gbps.This code is running with PyTorch 2.2.2, NCCL version 2.19.3
The text was updated successfully, but these errors were encountered: