[FSDP2] Fixed 2D clip grad norm test #126497

awgu · 2024-05-17T01:13:16Z

Stack from ghstack (oldest at bottom):

-> [FSDP2] Fixed 2D clip grad norm test #126497

This fixes #126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise (S(0), R) placement.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-17T01:13:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126497

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit e09cfec with merge base 4e6673e ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 7c452970d09805f883a6f292efce651bb81bb166 Pull Request resolved: #126497

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: 92373ca426a449391ca19b703c60dda0b6de2352 Pull Request resolved: #126497

This fixes #126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: b4e373a5ccc4772ef719beae550b9977d762c4dd Pull Request resolved: #126497

weifengpy · 2024-05-17T02:49:16Z

switching to fsdp2 release note is really some labor work

wz337

LGTM! Thanks for looking into it so quickly.

awgu · 2024-05-17T11:14:39Z

Failure are all inductor-related, not FSDP2-related.

awgu · 2024-05-17T11:14:42Z

@pytorchbot merge -i

pytorchmergebot · 2024-05-17T11:16:37Z

Merge started

Your change will be merged while ignoring the following 9 checks: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable), inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-17T20:29:13Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-05-17T20:29:24Z

@awgu your PR has been successfully reverted.

This reverts commit 3f28906. Reverted #126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](#126497 (comment)))

This fixes pytorch#126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: pytorch#126497 Approved by: https://github.com/weifengpy, https://github.com/wz337

This reverts commit 3f28906. Reverted pytorch#126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](pytorch#126497 (comment)))

awgu · 2024-05-20T23:15:01Z

@pytorchbot rebase -s

pytorchmergebot · 2024-05-20T23:16:45Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-05-20T23:17:03Z

Successfully rebased gh/awgu/588/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/126497)

ghstack-source-id: d6fbfab126e41561805c4c8cd4057769b1381d54 Pull Request resolved: #126497

This fixes #126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: dbf18ac97ab7a5eddef93664404af300827c29e7 Pull Request resolved: #126497

awgu · 2024-05-21T19:00:03Z

.ci/pytorch/test.sh

@@ -326,6 +326,7 @@ test_inductor_distributed() {
  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
+  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose


Test name was incorrect before (fixed to include trailing _).

awgu · 2024-05-21T19:00:15Z

@pytorchbot merge

pytorchmergebot · 2024-05-21T19:02:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-21T19:18:25Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3.8-clang10 / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

awgu · 2024-05-21T19:25:09Z

@pytorchbot rebase -s

pytorchmergebot · 2024-05-21T19:26:41Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-21T19:26:53Z

Tried to rebase and push PR #126497, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

ghstack-source-id: dbf18ac97ab7a5eddef93664404af300827c29e7 Pull Request resolved: #126497

awgu · 2024-05-21T23:33:48Z

@pytorchbot merge

pytorchmergebot · 2024-05-21T23:35:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP2] Fixed 2D clip grad norm test

3ef65a4

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 17, 2024

awgu added a commit that referenced this pull request May 17, 2024

[FSDP2] Fixed 2D clip grad norm test

4a061ad

ghstack-source-id: 7c452970d09805f883a6f292efce651bb81bb166 Pull Request resolved: #126497

Update on "[FSDP2] Fixed 2D clip grad norm test"

af2619c

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

awgu added a commit that referenced this pull request May 17, 2024

[FSDP2] Fixed 2D clip grad norm test

0c133fb

ghstack-source-id: 92373ca426a449391ca19b703c60dda0b6de2352 Pull Request resolved: #126497

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels May 17, 2024

awgu requested review from wanchaol, wz337 and weifengpy May 17, 2024 01:16

awgu marked this pull request as ready for review May 17, 2024 01:16

awgu requested a review from a team as a code owner May 17, 2024 01:16

awgu added the ciflow/inductor label May 17, 2024

weifengpy approved these changes May 17, 2024

View reviewed changes

awgu added a commit that referenced this pull request May 17, 2024

[FSDP2] Fixed 2D clip grad norm test

1663a38

ghstack-source-id: b4e373a5ccc4772ef719beae550b9977d762c4dd Pull Request resolved: #126497

weifengpy approved these changes May 17, 2024

View reviewed changes

wz337 approved these changes May 17, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2024

pytorchmergebot added the merging label May 17, 2024

pytorchmergebot added the Merged label May 17, 2024

pytorchmergebot closed this in 3f28906 May 17, 2024

pytorchmergebot removed the merging label May 17, 2024

pytorchmergebot added the Reverted label May 17, 2024

pytorchmergebot reopened this May 17, 2024

Update

8a1cce1

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request May 20, 2024

[FSDP2] Fixed 2D clip grad norm test

1468a91

ghstack-source-id: d6fbfab126e41561805c4c8cd4057769b1381d54 Pull Request resolved: #126497

awgu added a commit that referenced this pull request May 21, 2024

[FSDP2] Fixed 2D clip grad norm test

381b892

ghstack-source-id: dbf18ac97ab7a5eddef93664404af300827c29e7 Pull Request resolved: #126497

awgu commented May 21, 2024

View reviewed changes

pytorchmergebot added the merging label May 21, 2024

pytorchmergebot removed the merging label May 21, 2024

pytorchmergebot pushed a commit that referenced this pull request May 21, 2024

[FSDP2] Fixed 2D clip grad norm test

4ee89c4

ghstack-source-id: dbf18ac97ab7a5eddef93664404af300827c29e7 Pull Request resolved: #126497

pytorchmergebot added the merging label May 21, 2024

pytorchmergebot closed this in 636e799 May 22, 2024

pytorchmergebot removed the merging label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2] Fixed 2D clip grad norm test #126497

[FSDP2] Fixed 2D clip grad norm test #126497

awgu commented May 17, 2024 •

edited

pytorch-bot bot commented May 17, 2024 •

edited

weifengpy commented May 17, 2024

wz337 left a comment

awgu commented May 17, 2024

awgu commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

awgu commented May 20, 2024

pytorchmergebot commented May 20, 2024

pytorchmergebot commented May 20, 2024

awgu May 21, 2024

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

pytorchmergebot commented May 21, 2024

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

pytorchmergebot commented May 21, 2024

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

[FSDP2] Fixed 2D clip grad norm test #126497

[FSDP2] Fixed 2D clip grad norm test #126497

Conversation

awgu commented May 17, 2024 • edited

pytorch-bot bot commented May 17, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126497

✅ You can merge normally! (1 Unrelated Failure)

weifengpy commented May 17, 2024

wz337 left a comment

Choose a reason for hiding this comment

awgu commented May 17, 2024

awgu commented May 17, 2024

pytorchmergebot commented May 17, 2024

Merge started

pytorchmergebot commented May 17, 2024

pytorchmergebot commented May 17, 2024

awgu commented May 20, 2024

pytorchmergebot commented May 20, 2024

pytorchmergebot commented May 20, 2024

awgu May 21, 2024

Choose a reason for hiding this comment

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

Merge started

pytorchmergebot commented May 21, 2024

Merge failed

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

pytorchmergebot commented May 21, 2024

awgu commented May 21, 2024

pytorchmergebot commented May 21, 2024

Merge started

awgu commented May 17, 2024 •

edited

pytorch-bot bot commented May 17, 2024 •

edited