Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP2][2D] test_clip_grad_norm_2d is failing on main #126484

Closed
wz337 opened this issue May 17, 2024 · 0 comments
Closed

[FSDP2][2D] test_clip_grad_norm_2d is failing on main #126484

wz337 opened this issue May 17, 2024 · 0 comments
Labels

Comments

@wz337
Copy link
Contributor

wz337 commented May 17, 2024

馃悰 Describe the bug

repro:

python test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d

For norm_type 1, 2, 3, we observe numeric discrepancy between ref_total_norm and total_norm.
Some prints:

0, rank: 0, norm_type=2, ref_total_norm=tensor(1200.0919, device='cuda:0'), total_norm.full_tensor()=tensor(1200.5303, device='cuda:0')\
0, rank: 0, norm_type=1, ref_total_norm=tensor(48862.6328, device='cuda:0'), total_norm.full_tensor()=tensor(48963.7656, device='cuda:0')\
0, rank: 0, norm_type=3, ref_total_norm=tensor(463.1410, device='cuda:0'), total_norm.full_tensor()=tensor(463.1594, device='cuda:0')

We need to investigate in the numerics to confirm whether this is a bug.

cc. @awgu

Versions

N\A.

@wz337 wz337 added the release notes: distributed (fsdp2) release notes category label May 17, 2024
awgu added a commit that referenced this issue May 17, 2024
This fixes #126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
ZelboK pushed a commit to ZelboK/pytorch that referenced this issue May 19, 2024
This fixes pytorch#126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: pytorch#126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
awgu added a commit that referenced this issue May 21, 2024
This fixes #126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
awgu added a commit that referenced this issue May 21, 2024
This fixes #126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue May 22, 2024
This fixes #126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: #126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant