Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVLSTree: Do not reduce chunkSize to 64KiB on large number of nodes #1112

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

liralon
Copy link

@liralon liralon commented Dec 13, 2023

This heuristic was introduced by commit 8c6c595 ("2.19.3-1") with no explanation in commit message or comment in code other than "H800/H100 fixes and tuning".

In practice, we have noticed this line to cause a significant performance reduction when using NVLSTree algorithm when running nccl-tests all-reduce on a cluster of 8 AWS P5 nodes.

nccl-tests 4GB all-reduce output before this change:

                                                                   out-of-place                       in-place
            size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
             (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
      4294967296    1073741824     float     sum      -1    81146   52.93  104.20    N/A    80945   53.06  104.46    N/A

output after this change:

                                                                   out-of-place                       in-place
            size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
             (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
      4294967296    1073741824     float     sum      -1    30222  142.11  279.78    N/A    30309  141.70  278.98    N/A

output on commit f9c3dc2 ("2.19.1-1") prior to introducing this line:

                                                                   out-of-place                       in-place
            size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
             (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
      4294967296    1073741824     float     sum      -1    30196  142.24  280.03    N/A    30131  142.54  280.63    N/A

Fixes: 8c6c595 ("2.19.3-1")

This heuristic was introduced by commit 8c6c595 ("2.19.3-1") with
no explanation in commit message or comment in code other than
"H800/H100 fixes and tuning".

In practice, we have noticed this line to cause a significant
performance reduction when using NVLSTree algorithm when running
nccl-tests all-reduce on a cluster of 8 AWS P5 nodes.

nccl-tests 4GB all-reduce output before this change:
                                                               out-of-place                       in-place
        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  4294967296    1073741824     float     sum      -1    81146   52.93  104.20    N/A    80945   53.06  104.46    N/A

output after this change:
                                                               out-of-place                       in-place
        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  4294967296    1073741824     float     sum      -1    30222  142.11  279.78    N/A    30309  141.70  278.98    N/A

output on commit f9c3dc2 ("2.19.1-1") prior to introducing this line:
                                                               out-of-place                       in-place
        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  4294967296    1073741824     float     sum      -1    30196  142.24  280.03    N/A    30131  142.54  280.63    N/A

Fixes: 8c6c595 ("2.19.3-1")
Signed-off-by: Liran Alon <liran@amazon.com>
wangjiezhe added a commit to wangjiezhe/gentoo-local that referenced this pull request Feb 4, 2024
rajachan added a commit to rajachan/aws-ofi-nccl that referenced this pull request Feb 15, 2024
NCCL v2.19.3 reduced the chunk size used when running NVLS Tree
algorithm on greater than 4 nodes to 64KiB. This drastically impacted
performance on AWS (Ref: NVIDIA/nccl#1112
for some data). NCCL v2.20.3 has made this a tunable. Based on
empirical testing, a max chunk size of 512KiB recovers from the
regression and was also observed to be the default in v2.19.3.
Setting this unconditionally without relying on ncclGetVersion symbol
being available, since the parameter did not exist in versions prior
to v2.20.

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit to aws/aws-ofi-nccl that referenced this pull request Feb 15, 2024
NCCL v2.19.3 reduced the chunk size used when running NVLS Tree
algorithm on greater than 4 nodes to 64KiB. This drastically impacted
performance on AWS (Ref: NVIDIA/nccl#1112
for some data). NCCL v2.20.3 has made this a tunable. Based on
empirical testing, a max chunk size of 512KiB recovers from the
regression and was also observed to be the default in v2.19.3.
Setting this unconditionally without relying on ncclGetVersion symbol
being available, since the parameter did not exist in versions prior
to v2.20.

Signed-off-by: Raghu Raja <raghunch@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant