Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRNN toy (own data) loss: nan loss_ctc: nan #1996

Open
2 tasks done
anbo724 opened this issue Oct 7, 2023 · 3 comments
Open
2 tasks done

CRNN toy (own data) loss: nan loss_ctc: nan #1996

anbo724 opened this issue Oct 7, 2023 · 3 comments
Assignees

Comments

@anbo724
Copy link

anbo724 commented Oct 7, 2023

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

crnn_mini-vgg_5e_toy.py

training schedule for 1x

base = [
'../base/default_runtime.py',
'../base/datasets/ipa_data.py',
'../base/schedules/schedule_adadelta_5e.py',
'_base_crnn_mini-vgg.py',
]

dataset settings

train_list = [base.toy_rec_train]
test_list = [base.toy_rec_test]

default_hooks = dict(logger=dict(type='LoggerHook', interval=50), )

train_dataloader = dict(
batch_size=256,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='ConcatDataset',
datasets=train_list,
pipeline=base.train_pipeline))
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type='ConcatDataset',
datasets=test_list,
pipeline=base.test_pipeline))
test_dataloader = val_dataloader

base.model.decoder.dictionary.update(
dict(with_unknown=True, unknown_token=None))
base.train_cfg.update(dict(max_epochs=200, val_interval=10))

val_evaluator = dict(dataset_prefixes=['ipa'])
test_evaluator = val_evaluator

ipa_data.py

toy_data_root = '/home/lcj/mmocr/data/recog/ipa10w/'

toy_rec_train = dict(
type='OCRDataset',
data_root=toy_data_root,
data_prefix=dict(img_path='images/'),
ann_file='train_labels.json',
pipeline=None,
test_mode=False)

toy_rec_test = dict(
type='OCRDataset',
data_root=toy_data_root,
data_prefix=dict(img_path='images/'),
ann_file='test_labels.json',
pipeline=None,
test_mode=True)

Reproduces the problem - code sample

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/

Reproduces the problem - error message

10/07 23:13:52 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
10/07 23:13:52 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
10/07 23:13:52 - mmengine - INFO - Checkpoints will be saved to /home/lcj/mmocr/myipa.
10/07 23:14:03 - mmengine - INFO - Epoch(train) [1][ 50/391] lr: 1.0000e+00 eta: 4:41:58 time: 0.1844 data_time: 0.1028 memory: 1426 loss: 3.2401 loss_ctc: 3.2401
10/07 23:14:11 - mmengine - INFO - Epoch(train) [1][100/391] lr: 1.0000e+00 eta: 3:57:24 time: 0.1422 data_time: 0.0513 memory: 1426 loss: 3.1053 loss_ctc: 3.1053
10/07 23:14:18 - mmengine - INFO - Epoch(train) [1][150/391] lr: 1.0000e+00 eta: 3:44:46 time: 0.1379 data_time: 0.0508 memory: 1426 loss: 2.8808 loss_ctc: 2.8808
10/07 23:14:26 - mmengine - INFO - Epoch(train) [1][200/391] lr: 1.0000e+00 eta: 3:37:39 time: 0.1295 data_time: 0.0486 memory: 1426 loss: 2.9587 loss_ctc: 2.9587
10/07 23:14:34 - mmengine - INFO - Epoch(train) [1][250/391] lr: 1.0000e+00 eta: 3:34:55 time: 0.1951 data_time: 0.1082 memory: 1426 loss: 2.7018 loss_ctc: 2.7018
10/07 23:14:41 - mmengine - INFO - Epoch(train) [1][300/391] lr: 1.0000e+00 eta: 3:32:02 time: 0.1376 data_time: 0.0502 memory: 1426 loss: 2.4804 loss_ctc: 2.4804
10/07 23:14:49 - mmengine - INFO - Epoch(train) [1][350/391] lr: 1.0000e+00 eta: 3:30:05 time: 0.1351 data_time: 0.0509 memory: 1426 loss: nan loss_ctc: nan
10/07 23:14:55 - mmengine - INFO - Exp name: crnn_mini-vgg_5e_toy_20231007_231344
10/07 23:14:55 - mmengine - INFO - Saving checkpoint at 1 epochs
10/07 23:15:05 - mmengine - INFO - Epoch(train) [2][ 50/391] lr: 1.0000e+00 eta: 3:31:22 time: 0.1907 data_time: 0.0931 memory: 1426 loss: nan loss_ctc: nan
10/07 23:15:13 - mmengine - INFO - Epoch(train) [2][100/391] lr: 1.0000e+00 eta: 3:29:32 time: 0.1414 data_time: 0.0608 memory: 1426 loss: nan loss_ctc: nan

Additional information

No response

@Vegemo-bear
Copy link

我训练master时,刚开始就出现nan,好奇怪

@xReniar
Copy link

xReniar commented Mar 13, 2024

@anbo724 have you solved it?

@SolveProb
Copy link

我训练master时,刚开始就出现nan,好奇怪

同样遇到 最开始的时候 loss 为 inf,后续的loss 都为 nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants