Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update a3-highgpu-8g blueprint to use latest v5 tag #2572

Merged
merged 1 commit into from
May 15, 2024

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented May 10, 2024

Update a3-highgpu-8g AI/ML blueprint to use latest v5 release of Slurm-GCP solution. This has passed manual testing by provisioning the cluster and submitting sample AI/ML workloads.

cpu-bind=MASK - slurm0-a3-ghpc-1, task 13  5 [9817]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-1, task 14  6 [9818]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-1, task 15  7 [9819]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  1  1 [9970]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  2  2 [9971]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  3  3 [9972]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  4  4 [9973]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  5  5 [9974]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  6  6 [9975]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
cpu-bind=MASK - slurm0-a3-ghpc-0, task  7  7 [9976]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffff set
# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 2000 agg iters: 1 validation: 0 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   9969 on slurm0-a3-ghpc-0 device  0 [0x04] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid   9970 on slurm0-a3-ghpc-0 device  1 [0x05] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid   9971 on slurm0-a3-ghpc-0 device  2 [0x0a] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid   9972 on slurm0-a3-ghpc-0 device  3 [0x0b] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid   9973 on slurm0-a3-ghpc-0 device  4 [0x84] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid   9974 on slurm0-a3-ghpc-0 device  5 [0x85] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid   9975 on slurm0-a3-ghpc-0 device  6 [0x8a] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid   9976 on slurm0-a3-ghpc-0 device  7 [0x8b] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid   9812 on slurm0-a3-ghpc-1 device  0 [0x04] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid   9813 on slurm0-a3-ghpc-1 device  1 [0x05] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid   9814 on slurm0-a3-ghpc-1 device  2 [0x0a] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid   9815 on slurm0-a3-ghpc-1 device  3 [0x0b] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid   9816 on slurm0-a3-ghpc-1 device  4 [0x84] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid   9817 on slurm0-a3-ghpc-1 device  5 [0x85] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid   9818 on slurm0-a3-ghpc-1 device  6 [0x8a] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid   9819 on slurm0-a3-ghpc-1 device  7 [0x8b] NVIDIA H100 80GB HBM3

Submission Checklist

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cloud HPC Toolkit Contribution guidelines #

@tpdownes tpdownes self-assigned this May 10, 2024
@tpdownes tpdownes added the release-version-updates Added to release notes under the "Version Updates" heading. label May 14, 2024
@tpdownes tpdownes assigned harshthakkar01 and unassigned tpdownes May 14, 2024
@harshthakkar01
Copy link
Contributor

I think this will need to be updated to 5.11.1 right ?

@tpdownes
Copy link
Member Author

I think this will need to be updated to 5.11.1 right ?

Yes, but I will include that in the general Toolkit update to 5.11.1.

@tpdownes tpdownes assigned harshthakkar01 and unassigned tpdownes May 15, 2024
@tpdownes tpdownes merged commit a4cad58 into GoogleCloudPlatform:develop May 15, 2024
13 of 54 checks passed
@tpdownes tpdownes deleted the a3_image_update branch May 15, 2024 18:10
@harshthakkar01 harshthakkar01 mentioned this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-version-updates Added to release notes under the "Version Updates" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants