Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xxxx #19

Open
AlphaNext opened this issue Oct 27, 2023 · 0 comments

Comments

@AlphaNext
Copy link

start cmd

imagenetpath=mypath
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  moby_main.py \
       --cfg configs/moby_swin_tiny.yaml --data-path ${imagenetpath} --batch-size 256

but get the Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xxxx error

^[[32m[2023-10-24 17:33:21 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][290/625]  eta 0:05:52 lr 0.002772 time 0.5567 (1.0516)    loss 10.5960 (10.9174)  grad_norm 1.4802 (1.5236)       mem 45716MB^[[32m[2023-10-24 17:33:38 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][300/625]  eta 0:05:47 lr 0.002785 time 0.7607 (1.0707)    loss 10.7823 (10.9141)  grad_norm 2.3465 (1.5536)       mem 45716MB^[[32m[2023-10-24 17:33:45 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][310/625]  eta 0:05:33 lr 0.002797 time 0.9247 (1.0588)    loss 10.9386 (10.9140)  grad_norm 3.8597 (1.6136)       mem 45716MBGradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
^[[32m[2023-10-24 17:33:53 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][320/625]  eta 0:
05:20 lr 0.002810 time 0.5590 (1.0518)    loss 11.4219 (10.9264)  grad_norm 3.9233 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:00 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][330/625]  eta 0:
05:07 lr 0.002823 time 0.5751 (1.0412)    loss 11.6204 (10.9487)  grad_norm 2.7699 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:09 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][340/625]  eta 0:
04:55 lr 0.002836 time 0.5561 (1.0365)    loss 11.2880 (10.9609)  grad_norm 2.3273 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:16 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][350/625]  eta 0:
04:42 lr 0.002849 time 0.5530 (1.0271)    loss 11.0601 (10.9651)  grad_norm 0.9230 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:23 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][360/625]  eta 0:
04:30 lr 0.002861 time 0.5628 (1.0200)    loss 10.9609 (10.9669)  grad_norm 0.8707 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:30 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][370/625]  eta 0:
04:17 lr 0.002874 time 0.5648 (1.0094)    loss 10.9728 (10.9655)  grad_norm 1.9388 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:36 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][380/625]  eta 0:
04:04 lr 0.002887 time 0.5568 (0.9993)    loss 10.8801 (10.9645)  grad_norm 0.6718 (inf)  mem 45716MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant