We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
main branch https://github.com/open-mmlab/mmocr
crnn_mini-vgg_5e_toy.py
base = [ '../base/default_runtime.py', '../base/datasets/ipa_data.py', '../base/schedules/schedule_adadelta_5e.py', '_base_crnn_mini-vgg.py', ]
train_list = [base.toy_rec_train] test_list = [base.toy_rec_test]
default_hooks = dict(logger=dict(type='LoggerHook', interval=50), )
train_dataloader = dict( batch_size=256, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type='ConcatDataset', datasets=train_list, pipeline=base.train_pipeline)) val_dataloader = dict( batch_size=1, num_workers=4, persistent_workers=True, drop_last=False, sampler=dict(type='DefaultSampler', shuffle=False), dataset=dict( type='ConcatDataset', datasets=test_list, pipeline=base.test_pipeline)) test_dataloader = val_dataloader
base.model.decoder.dictionary.update( dict(with_unknown=True, unknown_token=None)) base.train_cfg.update(dict(max_epochs=200, val_interval=10))
val_evaluator = dict(dataset_prefixes=['ipa']) test_evaluator = val_evaluator
ipa_data.py
toy_data_root = '/home/lcj/mmocr/data/recog/ipa10w/'
toy_rec_train = dict( type='OCRDataset', data_root=toy_data_root, data_prefix=dict(img_path='images/'), ann_file='train_labels.json', pipeline=None, test_mode=False)
toy_rec_test = dict( type='OCRDataset', data_root=toy_data_root, data_prefix=dict(img_path='images/'), ann_file='test_labels.json', pipeline=None, test_mode=True)
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/
10/07 23:13:52 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 10/07 23:13:52 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 10/07 23:13:52 - mmengine - INFO - Checkpoints will be saved to /home/lcj/mmocr/myipa. 10/07 23:14:03 - mmengine - INFO - Epoch(train) [1][ 50/391] lr: 1.0000e+00 eta: 4:41:58 time: 0.1844 data_time: 0.1028 memory: 1426 loss: 3.2401 loss_ctc: 3.2401 10/07 23:14:11 - mmengine - INFO - Epoch(train) [1][100/391] lr: 1.0000e+00 eta: 3:57:24 time: 0.1422 data_time: 0.0513 memory: 1426 loss: 3.1053 loss_ctc: 3.1053 10/07 23:14:18 - mmengine - INFO - Epoch(train) [1][150/391] lr: 1.0000e+00 eta: 3:44:46 time: 0.1379 data_time: 0.0508 memory: 1426 loss: 2.8808 loss_ctc: 2.8808 10/07 23:14:26 - mmengine - INFO - Epoch(train) [1][200/391] lr: 1.0000e+00 eta: 3:37:39 time: 0.1295 data_time: 0.0486 memory: 1426 loss: 2.9587 loss_ctc: 2.9587 10/07 23:14:34 - mmengine - INFO - Epoch(train) [1][250/391] lr: 1.0000e+00 eta: 3:34:55 time: 0.1951 data_time: 0.1082 memory: 1426 loss: 2.7018 loss_ctc: 2.7018 10/07 23:14:41 - mmengine - INFO - Epoch(train) [1][300/391] lr: 1.0000e+00 eta: 3:32:02 time: 0.1376 data_time: 0.0502 memory: 1426 loss: 2.4804 loss_ctc: 2.4804 10/07 23:14:49 - mmengine - INFO - Epoch(train) [1][350/391] lr: 1.0000e+00 eta: 3:30:05 time: 0.1351 data_time: 0.0509 memory: 1426 loss: nan loss_ctc: nan 10/07 23:14:55 - mmengine - INFO - Exp name: crnn_mini-vgg_5e_toy_20231007_231344 10/07 23:14:55 - mmengine - INFO - Saving checkpoint at 1 epochs 10/07 23:15:05 - mmengine - INFO - Epoch(train) [2][ 50/391] lr: 1.0000e+00 eta: 3:31:22 time: 0.1907 data_time: 0.0931 memory: 1426 loss: nan loss_ctc: nan 10/07 23:15:13 - mmengine - INFO - Epoch(train) [2][100/391] lr: 1.0000e+00 eta: 3:29:32 time: 0.1414 data_time: 0.0608 memory: 1426 loss: nan loss_ctc: nan
No response
The text was updated successfully, but these errors were encountered:
我训练master时,刚开始就出现nan,好奇怪
Sorry, something went wrong.
@anbo724 have you solved it?
同样遇到 最开始的时候 loss 为 inf,后续的loss 都为 nan
Harold-lkk
No branches or pull requests
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmocr
Environment
crnn_mini-vgg_5e_toy.py
training schedule for 1x
base = [
'../base/default_runtime.py',
'../base/datasets/ipa_data.py',
'../base/schedules/schedule_adadelta_5e.py',
'_base_crnn_mini-vgg.py',
]
dataset settings
train_list = [base.toy_rec_train]
test_list = [base.toy_rec_test]
default_hooks = dict(logger=dict(type='LoggerHook', interval=50), )
train_dataloader = dict(
batch_size=256,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='ConcatDataset',
datasets=train_list,
pipeline=base.train_pipeline))
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type='ConcatDataset',
datasets=test_list,
pipeline=base.test_pipeline))
test_dataloader = val_dataloader
base.model.decoder.dictionary.update(
dict(with_unknown=True, unknown_token=None))
base.train_cfg.update(dict(max_epochs=200, val_interval=10))
val_evaluator = dict(dataset_prefixes=['ipa'])
test_evaluator = val_evaluator
ipa_data.py
toy_data_root = '/home/lcj/mmocr/data/recog/ipa10w/'
toy_rec_train = dict(
type='OCRDataset',
data_root=toy_data_root,
data_prefix=dict(img_path='images/'),
ann_file='train_labels.json',
pipeline=None,
test_mode=False)
toy_rec_test = dict(
type='OCRDataset',
data_root=toy_data_root,
data_prefix=dict(img_path='images/'),
ann_file='test_labels.json',
pipeline=None,
test_mode=True)
Reproduces the problem - code sample
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/
Reproduces the problem - error message
10/07 23:13:52 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
10/07 23:13:52 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
10/07 23:13:52 - mmengine - INFO - Checkpoints will be saved to /home/lcj/mmocr/myipa.
10/07 23:14:03 - mmengine - INFO - Epoch(train) [1][ 50/391] lr: 1.0000e+00 eta: 4:41:58 time: 0.1844 data_time: 0.1028 memory: 1426 loss: 3.2401 loss_ctc: 3.2401
10/07 23:14:11 - mmengine - INFO - Epoch(train) [1][100/391] lr: 1.0000e+00 eta: 3:57:24 time: 0.1422 data_time: 0.0513 memory: 1426 loss: 3.1053 loss_ctc: 3.1053
10/07 23:14:18 - mmengine - INFO - Epoch(train) [1][150/391] lr: 1.0000e+00 eta: 3:44:46 time: 0.1379 data_time: 0.0508 memory: 1426 loss: 2.8808 loss_ctc: 2.8808
10/07 23:14:26 - mmengine - INFO - Epoch(train) [1][200/391] lr: 1.0000e+00 eta: 3:37:39 time: 0.1295 data_time: 0.0486 memory: 1426 loss: 2.9587 loss_ctc: 2.9587
10/07 23:14:34 - mmengine - INFO - Epoch(train) [1][250/391] lr: 1.0000e+00 eta: 3:34:55 time: 0.1951 data_time: 0.1082 memory: 1426 loss: 2.7018 loss_ctc: 2.7018
10/07 23:14:41 - mmengine - INFO - Epoch(train) [1][300/391] lr: 1.0000e+00 eta: 3:32:02 time: 0.1376 data_time: 0.0502 memory: 1426 loss: 2.4804 loss_ctc: 2.4804
10/07 23:14:49 - mmengine - INFO - Epoch(train) [1][350/391] lr: 1.0000e+00 eta: 3:30:05 time: 0.1351 data_time: 0.0509 memory: 1426 loss: nan loss_ctc: nan
10/07 23:14:55 - mmengine - INFO - Exp name: crnn_mini-vgg_5e_toy_20231007_231344
10/07 23:14:55 - mmengine - INFO - Saving checkpoint at 1 epochs
10/07 23:15:05 - mmengine - INFO - Epoch(train) [2][ 50/391] lr: 1.0000e+00 eta: 3:31:22 time: 0.1907 data_time: 0.0931 memory: 1426 loss: nan loss_ctc: nan
10/07 23:15:13 - mmengine - INFO - Epoch(train) [2][100/391] lr: 1.0000e+00 eta: 3:29:32 time: 0.1414 data_time: 0.0608 memory: 1426 loss: nan loss_ctc: nan
Additional information
No response
The text was updated successfully, but these errors were encountered: