Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练SwinTransformer模型,loss不下降 #3098

Open
lhscau opened this issue Mar 1, 2024 · 9 comments
Open

训练SwinTransformer模型,loss不下降 #3098

lhscau opened this issue Mar 1, 2024 · 9 comments
Assignees

Comments

@lhscau
Copy link

lhscau commented Mar 1, 2024

使用https://github.com/PaddlePaddle/PaddleClas代码,develop分支,训练SwinTransformer模型,不收敛,loss曲线一直上升。
硬件环境:8卡A100 80G 和4卡A10
CUDA版本:12.0和11.7
paddle版本:paddlepaddle-gpu 2.6.0.post117
操作系统:ubuntu18.04
训练脚本:

python
-m paddle.distributed.launch
--ips="127.0.0.1"
--devices="0,1,2,3,4,5,6,7"
tools/train.py
-c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml

现象:训练开始后,loss从2.6左右开始上升,一直呈上升趋势。

问题排查,在训练超参中,添加了pretrain_mode,设置为null和指定模型,现象一样。

请排查该问题,确保训练能正常执行,loss下降。

@lhscau
Copy link
Author

lhscau commented Mar 1, 2024

部分训练日志:[2024/03/01 09:07:55] ppcls INFO: [Train][Epoch 65/100][Iter: 0/5005]lr(LinearWarmup): 0.00022245, CELoss: 3.61639, loss: 3.61639, batch_cost: 0.53438s, reader_cost: 0.00619, ips: 119.76526 samples/s, eta: 1 day, 2:44:44
[2024/03/01 09:08:01] ppcls INFO: [Train][Epoch 65/100][Iter: 10/5005]lr(LinearWarmup): 0.00022244, CELoss: 3.33753, loss: 3.33753, batch_cost: 0.53119s, reader_cost: 0.00340, ips: 120.48394 samples/s, eta: 1 day, 2:35:04
[2024/03/01 09:08:06] ppcls INFO: [Train][Epoch 65/100][Iter: 20/5005]lr(LinearWarmup): 0.00022242, CELoss: 3.29766, loss: 3.29766, batch_cost: 0.53110s, reader_cost: 0.00289, ips: 120.50379 samples/s, eta: 1 day, 2:34:43
[2024/03/01 09:08:11] ppcls INFO: [Train][Epoch 65/100][Iter: 30/5005]lr(LinearWarmup): 0.00022240, CELoss: 3.41847, loss: 3.41847, batch_cost: 0.53157s, reader_cost: 0.00306, ips: 120.39770 samples/s, eta: 1 day, 2:36:02
[2024/03/01 09:08:17] ppcls INFO: [Train][Epoch 65/100][Iter: 40/5005]lr(LinearWarmup): 0.00022238, CELoss: 3.39770, loss: 3.39770, batch_cost: 0.53132s, reader_cost: 0.00313, ips: 120.45380 samples/s, eta: 1 day, 2:35:12
[2024/03/01 09:08:22] ppcls INFO: [Train][Epoch 65/100][Iter: 50/5005]lr(LinearWarmup): 0.00022236, CELoss: 3.39246, loss: 3.39246, batch_cost: 0.53125s, reader_cost: 0.00311, ips: 120.46964 samples/s, eta: 1 day, 2:34:54
[2024/03/01 09:08:27] ppcls INFO: [Train][Epoch 65/100][Iter: 60/5005]lr(LinearWarmup): 0.00022234, CELoss: 3.41512, loss: 3.41512, batch_cost: 0.53115s, reader_cost: 0.00309, ips: 120.49334 samples/s, eta: 1 day, 2:34:30
[2024/03/01 09:08:33] ppcls INFO: [Train][Epoch 65/100][Iter: 70/5005]lr(LinearWarmup): 0.00022232, CELoss: 3.43328, loss: 3.43328, batch_cost: 0.53103s, reader_cost: 0.00307, ips: 120.52054 samples/s, eta: 1 day, 2:34:03
[2024/03/01 09:08:38] ppcls INFO: [Train][Epoch 65/100][Iter: 80/5005]lr(LinearWarmup): 0.00022231, CELoss: 3.44480, loss: 3.44480, batch_cost: 0.53114s, reader_cost: 0.00302, ips: 120.49510 samples/s, eta: 1 day, 2:34:18
[2024/03/01 09:08:43] ppcls INFO: [Train][Epoch 65/100][Iter: 90/5005]lr(LinearWarmup): 0.00022229, CELoss: 3.46289, loss: 3.46289, batch_cost: 0.53121s, reader_cost: 0.00298, ips: 120.47976 samples/s, eta: 1 day, 2:34:25
[2024/03/01 09:08:49] ppcls INFO: [Train][Epoch 65/100][Iter: 100/5005]lr(LinearWarmup): 0.00022227, CELoss: 3.47811, loss: 3.47811, batch_cost: 0.53120s, reader_cost: 0.00294, ips: 120.48238 samples/s, eta: 1 day, 2:34:18
[2024/03/01 09:08:54] ppcls INFO: [Train][Epoch 65/100][Iter: 110/5005]lr(LinearWarmup): 0.00022225, CELoss: 3.49017, loss: 3.49017, batch_cost: 0.53117s, reader_cost: 0.00292, ips: 120.48843 samples/s, eta: 1 day, 2:34:08
[2024/03/01 09:08:59] ppcls INFO: [Train][Epoch 65/100][Iter: 120/5005]lr(LinearWarmup): 0.00022223, CELoss: 3.48991, loss: 3.48991, batch_cost: 0.53115s, reader_cost: 0.00290, ips: 120.49402 samples/s, eta: 1 day, 2:33:58
[2024/03/01 09:09:05] ppcls INFO: [Train][Epoch 65/100][Iter: 130/5005]lr(LinearWarmup): 0.00022221, CELoss: 3.45353, loss: 3.45353, batch_cost: 0.53109s, reader_cost: 0.00288, ips: 120.50773 samples/s, eta: 1 day, 2:33:42
[2024/03/01 09:09:10] ppcls INFO: [Train][Epoch 65/100][Iter: 140/5005]lr(LinearWarmup): 0.00022219, CELoss: 3.46932, loss: 3.46932, batch_cost: 0.53105s, reader_cost: 0.00286, ips: 120.51606 samples/s, eta: 1 day, 2:33:30
[2024/03/01 09:09:15] ppcls INFO: [Train][Epoch 65/100][Iter: 150/5005]lr(LinearWarmup): 0.00022217, CELoss: 3.46338, loss: 3.46338, batch_cost: 0.53105s, reader_cost: 0.00285, ips: 120.51606 samples/s, eta: 1 day, 2:33:24
[2024/03/01 09:09:21] ppcls INFO: [Train][Epoch 65/100][Iter: 160/5005]lr(LinearWarmup): 0.00022216, CELoss: 3.44023, loss: 3.44023, batch_cost: 0.53117s, reader_cost: 0.00279, ips: 120.48829 samples/s, eta: 1 day, 2:33:41
[2024/03/01 09:09:26] ppcls INFO: [Train][Epoch 65/100][Iter: 170/5005]lr(LinearWarmup): 0.00022214, CELoss: 3.44467, loss: 3.44467, batch_cost: 0.53111s, reader_cost: 0.00265, ips: 120.50234 samples/s, eta: 1 day, 2:33:25
[2024/03/01 09:09:31] ppcls INFO: [Train][Epoch 65/100][Iter: 180/5005]lr(LinearWarmup): 0.00022212, CELoss: 3.44845, loss: 3.44845, batch_cost: 0.53107s, reader_cost: 0.00251, ips: 120.51049 samples/s, eta: 1 day, 2:33:13
[2024/03/01 09:09:36] ppcls INFO: [Train][Epoch 65/100][Iter: 190/5005]lr(LinearWarmup): 0.00022210, CELoss: 3.45565, loss: 3.45565, batch_cost: 0.53102s, reader_cost: 0.00239, ips: 120.52378 samples/s, eta: 1 day, 2:32:57
[2024/03/01 09:09:42] ppcls INFO: [Train][Epoch 65/100][Iter: 200/5005]lr(LinearWarmup): 0.00022208, CELoss: 3.45171, loss: 3.45171, batch_cost: 0.53098s, reader_cost: 0.00227, ips: 120.53262 samples/s, eta: 1 day, 2:32:45

@lhscau
Copy link
Author

lhscau commented Mar 1, 2024

training_script_args: ['-c', 'ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml', '-o', 'Global.device=gpu', '-o', 'Global.use_dali=False', '-o', 'Global.epochs=100', '-o', 'Global.save_interval=10', '-o', 'Global.use_visualdl=True', '-o', 'DataLoader.Train.sampler.batch_size=64', '-o', 'Global.output_dir=./output/SwinTransformer_O1_mlu_16chips']

@changdazhou
Copy link
Contributor

请问训练使用的什么数据集呢

@lhscau
Copy link
Author

lhscau commented Mar 6, 2024

数据集:ILSVRC2012数据集imagenet_train

@changdazhou
Copy link
Contributor

看你这边减半了batch_size,相应的learning_rate也要减半哈

@lhscau
Copy link
Author

lhscau commented Mar 11, 2024

请问能提供一下这个模型的训练日志,测试脚本,loss下降等相关参考吗?

@changdazhou
Copy link
Contributor

这个建议参考官方提供的示例哈,或者你可以把你的配置贴上,我这边看下呢

@lhscau
Copy link
Author

lhscau commented Mar 13, 2024

python -m paddle.distributed.launch --ips="127.0.0.1" --devices="0,1,2,3,4,5,6,7," tools/train.py
-c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml
-o Global.use_dali=False
-o Global.use_visualdl=True
-o Global.output_dir=./output/SwinTransformer_O1_mlu_16chips \

@changdazhou
Copy link
Contributor

python -m paddle.distributed.launch --ips="127.0.0.1" --devices="0,1,2,3,4,5,6,7," tools/train.py -c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml -o Global.use_dali=False -o Global.use_visualdl=True -o Global.output_dir=./output/SwinTransformer_O1_mlu_16chips \

开启了amp O1了是吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants