Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented May 13, 2024

(created using eb --new-pr)
This is meant as an alternative to #20155 using a newer NCCL version as the older one currently included in foss/2022b doesn't seem to work with PyTorch 2.1.2

@SebastianAchilles SebastianAchilles added this to the 4.x milestone May 14, 2024
@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18
See https://gist.github.com/SebastianAchilles/caa73902c24edfc4a9f09a1104e38750 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/c2693ff5dacd31a35769e1bca1515fc6 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @SebastianAchilles FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8 See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

That first one failed with

distributed/_tensor/test_dtensor_ops 1/1 failed! Received signal: SIGSEGV

I see that every now and then in various different tests especially test_jit*. Seems to happen randomly, not sure why.

I'll do a larger repeated run for both PRs over the weekend so I'll have the results to compare on Tuesday (Monday is a public holiday here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants