Training process occurs nan at the first ten batch. #411

powermano · 2021-08-12T08:11:12Z

envs: Mxnet1.8 fp16 resnet50, byteps 0.2.5.post15.
Training process occurs the nan, as shown in the following: ( changing lr from 0.1 to 0.001, the nan disappear, But the loss seems not able to decrease.)

2021-08-12 05:44:02,294 Epoch[1] Batch[1] Speed: 2.38 samples/sec,  IDLoss=45.729,
2021-08-12 05:44:02,654 Epoch[1] Batch[2] Speed: 606.44 samples/sec,  IDLoss=46.012,
2021-08-12 05:44:03,550 Epoch[1] Batch[3] Speed: 142.99 samples/sec,  IDLoss=45.780,
2021-08-12 05:44:04,426 Epoch[1] Batch[4] Speed: 146.38 samples/sec,  IDLoss=60.220,
2021-08-12 05:44:05,311 Epoch[1] Batch[5] Speed: 229.28 samples/sec,  IDLoss=61.163,
2021-08-12 05:44:06,251 Epoch[1] Batch[6] Speed: 136.34 samples/sec,  IDLoss=70.405,
2021-08-12 05:44:07,141 Epoch[1] Batch[7] Speed: 143.86 samples/sec,  IDLoss=nan,
2021-08-12 05:44:07,971 Epoch[1] Batch[8] Speed: 156.27 samples/sec,  IDLoss=nan,
2021-08-12 05:44:08,882 Epoch[1] Batch[9] Speed: 140.62 samples/sec,  IDLoss=nan,

But when I use fp32, the nan disappear, Is there any problems with byteps fp16.

The text was updated successfully, but these errors were encountered:

powermano · 2021-08-12T08:12:58Z

@ymjiang @bobzhuyb

powermano · 2021-08-12T08:41:07Z

Do not use byteps, the code training process works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training process occurs nan at the first ten batch. #411

Training process occurs nan at the first ten batch. #411

powermano commented Aug 12, 2021

powermano commented Aug 12, 2021

powermano commented Aug 12, 2021

Training process occurs nan at the first ten batch. #411

Training process occurs nan at the first ten batch. #411

Comments

powermano commented Aug 12, 2021

powermano commented Aug 12, 2021

powermano commented Aug 12, 2021