about the loss of Mocov3, no decreasing? #1850

RobinHan24 · 2023-12-20T01:42:37Z

分支

main 分支 (mmpretrain 版本)

描述该错误

I trained my own dataset with MocoV3-resnet50, but the loss decreased from 27 to 23, holding on the number of 23, why?

12/20 09:34:59 - mmengine - INFO - Saving checkpoint at 3562 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:11 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:11 - mmengine - INFO - Epoch(train) [3563][3/3] lr: 2.4223e+00 eta: 0:56:54 time: 2.4053 data_time: 1.6347 memory: 18037 loss: 23.5969
12/20 09:35:11 - mmengine - INFO - Saving checkpoint at 3563 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:22 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:22 - mmengine - INFO - Epoch(train) [3564][3/3] lr: 2.4127e+00 eta: 0:56:47 time: 2.3942 data_time: 1.6182 memory: 18037 loss: 23.5970
12/20 09:35:22 - mmengine - INFO - Saving checkpoint at 3564 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:33 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:33 - mmengine - INFO - Epoch(train) [3565][3/3] lr: 2.4032e+00 eta: 0:56:39 time: 2.3664 data_time: 1.5931 memory: 18037 loss: 23.5975
12/20 09:35:33 - mmengine - INFO - Saving checkpoint at 3565 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:44 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:44 - mmengine - INFO - Epoch(train) [3566][3/3] lr: 2.3936e+00 eta: 0:56:31 time: 2.3551 data_time: 1.5844 memory: 18037 loss: 23.5980
12/20 09:35:44 - mmengine - INFO - Saving checkpoint at 3566 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:56 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:56 - mmengine - INFO - Epoch(train) [3567][3/3] lr: 2.3841e+00 eta: 0:56:23 time: 2.3336 data_time: 1.5614 memory: 18037 loss: 23.5950
12/20 09:35:56 - mmengine - INFO - Saving checkpoint at 3567 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:07 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:07 - mmengine - INFO - Epoch(train) [3568][3/3] lr: 2.3745e+00 eta: 0:56:15 time: 2.4241 data_time: 1.6476 memory: 18037 loss: 23.5938
12/20 09:36:07 - mmengine - INFO - Saving checkpoint at 3568 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:18 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:18 - mmengine - INFO - Epoch(train) [3569][3/3] lr: 2.3650e+00 eta: 0:56:08 time: 2.4099 data_time: 1.6249 memory: 18037 loss: 23.5949
12/20 09:36:18 - mmengine - INFO - Saving checkpoint at 3569 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:29 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:29 - mmengine - INFO - Epoch(train) [3570][3/3] lr: 2.3555e+00 eta: 0:56:00 time: 2.3381 data_time: 1.5578 memory: 18037 loss: 23.5944
12/20 09:36:29 - mmengine - INFO - Saving checkpoint at 3570 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:39 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:39 - mmengine - INFO - Epoch(train) [3571][3/3] lr: 2.3459e+00 eta: 0:55:52 time: 2.1778 data_time: 1.4037 memory: 18037 loss: 23.5939
12/20 09:36:39 - mmengine - INFO - Saving checkpoint at 3571 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:50 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:50 - mmengine - INFO - Epoch(train) [3572][3/3] lr: 2.3364e+00 eta: 0:55:44 time: 2.2672 data_time: 1.4953 memory: 18037 loss: 23.5931
12/20 09:36:50 - mmengine - INFO - Saving checkpoint at 3572 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:37:02 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:37:02 - mmengine - INFO - Epoch(train) [3573][3/3] lr: 2.3268e+00 eta: 0:55:36 time: 2.3623 data_time: 1.5885 memory: 18037 loss: 23.5929
12/20 09:37:02 - mmengine - INFO - Saving checkpoint at 3573 epochs

环境信息

{'sys.platform': 'linux',
'Python': '3.9.0 (default, Nov 15 2020, 14:28:56) [GCC 7.3.0]',
'CUDA available': True,
'numpy_random_seed': 2147483648,
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A10',
'CUDA_HOME': '/usr/local/cuda-11.7',
'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'PyTorch': '1.13.1+cu117',
'TorchVision': '0.14.1+cu117',
'OpenCV': '4.8.0',
'MMEngine': '0.7.3',
'MMCV': '2.0.0',
'MMPreTrain': '1.0.0rc7+e80418a'}

其他信息

my configure file:
mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
base = [
'imagenet_bs512_mocov3.py',
'default_runtime.py',
]

model settings

temperature = 1.0
model = dict(
type='MoCoV3',
base_momentum=0.004, # 0.01 for 100e and 300e, 0.004 for 800 and 1000e
backbone=dict(
type='ResNet',
depth=50,
norm_cfg=dict(type='SyncBN'),
zero_init_residual=False),
neck=dict(
type='NonLinearNeck',
in_channels=2048,
hid_channels=4096,
out_channels=256,
num_layers=2,
with_bias=False,
with_last_bn=True,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=True),
head=dict(
type='MoCoV3Head',
predictor=dict(
type='NonLinearNeck',
in_channels=256,
hid_channels=4096,
out_channels=256,
num_layers=2,
with_bias=False,
with_last_bn=False,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=False),
loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
temperature=temperature))

optimizer

optim_wrapper = dict(
type='AmpOptimWrapper',
loss_scale='dynamic',
optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9),
paramwise_cfg=dict(
custom_keys={
'bn': dict(decay_mult=0, lars_exclude=True),
'bias': dict(decay_mult=0, lars_exclude=True),
# bn layer in ResNet block downsample module
'downsample.1': dict(decay_mult=0, lars_exclude=True),
}),
)

learning rate scheduler

param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=790,
by_epoch=True,
begin=10,
end=4000,
convert_to_iter_based=True)
]

runtime settings

train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=4000)

only keeps the latest 3 checkpoints

default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))

NOTE: `auto_scale_lr` is for automatically scaling LR

based on the actual training batch size.

auto_scale_lr = dict(base_batch_size=4096)

imagenet_bs512_mocov3.py

dataset settings

dataset_type = 'CustomDataset'
data_root = 'data/yf5class_old/'
data_preprocessor = dict(
type='SelfSupDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True)

view_pipeline1 = [
dict(
type='RandomResizedCrop',
scale=224,
crop_ratio_range=(0.2, 1.),
backend='pillow'),
dict(
type='RandomApply',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
prob=0.8),
dict(
type='RandomGrayscale',
prob=0.2,
keep_channels=True,
channel_weights=(0.114, 0.587, 0.2989)),
dict(
type='GaussianBlur',
magnitude_range=(0.1, 2.0),
magnitude_std='inf',
prob=1.),
dict(type='Solarize', thr=128, prob=0.),
dict(type='RandomFlip', prob=0.5),
]
view_pipeline2 = [
dict(
type='RandomResizedCrop',
scale=224,
crop_ratio_range=(0.2, 1.),
backend='pillow'),
dict(
type='RandomApply',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
prob=0.8),
dict(
type='RandomGrayscale',
prob=0.2,
keep_channels=True,
channel_weights=(0.114, 0.587, 0.2989)),
dict(
type='GaussianBlur',
magnitude_range=(0.1, 2.0),
magnitude_std='inf',
prob=0.1),
dict(type='Solarize', thr=128, prob=0.2),
dict(type='RandomFlip', prob=0.5),
]
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiView',
num_views=[1, 1],
transforms=[view_pipeline1, view_pipeline2]),
dict(type='PackInputs')
]

train_dataloader = dict(
batch_size=192,
num_workers=8,
persistent_workers=True,
pin_memory=True,
sampler=dict(type='DefaultSampler', shuffle=True),
collate_fn=dict(type='default_collate'),
dataset=dict(
type='CustomDataset',
data_root=data_root,
ann_file='', # 我们假定使用子文件夹格式，因此需要将标注文件置空
data_prefix='train',
pipeline=train_pipeline))

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the loss of Mocov3, no decreasing? #1850

about the loss of Mocov3, no decreasing? #1850

RobinHan24 commented Dec 20, 2023

about the loss of Mocov3, no decreasing? #1850

about the loss of Mocov3, no decreasing? #1850

Comments

RobinHan24 commented Dec 20, 2023

分支

描述该错误

环境信息

其他信息

model settings

optimizer

learning rate scheduler

runtime settings

only keeps the latest 3 checkpoints

NOTE: auto_scale_lr is for automatically scaling LR

based on the actual training batch size.

dataset settings

NOTE: `auto_scale_lr` is for automatically scaling LR