Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the loss of Mocov3, no decreasing? #1850

Open
RobinHan24 opened this issue Dec 20, 2023 · 0 comments
Open

about the loss of Mocov3, no decreasing? #1850

RobinHan24 opened this issue Dec 20, 2023 · 0 comments

Comments

@RobinHan24
Copy link

分支

main 分支 (mmpretrain 版本)

描述该错误

I trained my own dataset with MocoV3-resnet50, but the loss decreased from 27 to 23, holding on the number of 23, why?

12/20 09:34:59 - mmengine - INFO - Saving checkpoint at 3562 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:11 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:11 - mmengine - INFO - Epoch(train) [3563][3/3] lr: 2.4223e+00 eta: 0:56:54 time: 2.4053 data_time: 1.6347 memory: 18037 loss: 23.5969
12/20 09:35:11 - mmengine - INFO - Saving checkpoint at 3563 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:22 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:22 - mmengine - INFO - Epoch(train) [3564][3/3] lr: 2.4127e+00 eta: 0:56:47 time: 2.3942 data_time: 1.6182 memory: 18037 loss: 23.5970
12/20 09:35:22 - mmengine - INFO - Saving checkpoint at 3564 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:33 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:33 - mmengine - INFO - Epoch(train) [3565][3/3] lr: 2.4032e+00 eta: 0:56:39 time: 2.3664 data_time: 1.5931 memory: 18037 loss: 23.5975
12/20 09:35:33 - mmengine - INFO - Saving checkpoint at 3565 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:44 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:44 - mmengine - INFO - Epoch(train) [3566][3/3] lr: 2.3936e+00 eta: 0:56:31 time: 2.3551 data_time: 1.5844 memory: 18037 loss: 23.5980
12/20 09:35:44 - mmengine - INFO - Saving checkpoint at 3566 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:35:56 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:35:56 - mmengine - INFO - Epoch(train) [3567][3/3] lr: 2.3841e+00 eta: 0:56:23 time: 2.3336 data_time: 1.5614 memory: 18037 loss: 23.5950
12/20 09:35:56 - mmengine - INFO - Saving checkpoint at 3567 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:07 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:07 - mmengine - INFO - Epoch(train) [3568][3/3] lr: 2.3745e+00 eta: 0:56:15 time: 2.4241 data_time: 1.6476 memory: 18037 loss: 23.5938
12/20 09:36:07 - mmengine - INFO - Saving checkpoint at 3568 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:18 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:18 - mmengine - INFO - Epoch(train) [3569][3/3] lr: 2.3650e+00 eta: 0:56:08 time: 2.4099 data_time: 1.6249 memory: 18037 loss: 23.5949
12/20 09:36:18 - mmengine - INFO - Saving checkpoint at 3569 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:29 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:29 - mmengine - INFO - Epoch(train) [3570][3/3] lr: 2.3555e+00 eta: 0:56:00 time: 2.3381 data_time: 1.5578 memory: 18037 loss: 23.5944
12/20 09:36:29 - mmengine - INFO - Saving checkpoint at 3570 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:39 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:39 - mmengine - INFO - Epoch(train) [3571][3/3] lr: 2.3459e+00 eta: 0:55:52 time: 2.1778 data_time: 1.4037 memory: 18037 loss: 23.5939
12/20 09:36:39 - mmengine - INFO - Saving checkpoint at 3571 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:36:50 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:36:50 - mmengine - INFO - Epoch(train) [3572][3/3] lr: 2.3364e+00 eta: 0:55:44 time: 2.2672 data_time: 1.4953 memory: 18037 loss: 23.5931
12/20 09:36:50 - mmengine - INFO - Saving checkpoint at 3572 epochs
/mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
12/20 09:37:02 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815
12/20 09:37:02 - mmengine - INFO - Epoch(train) [3573][3/3] lr: 2.3268e+00 eta: 0:55:36 time: 2.3623 data_time: 1.5885 memory: 18037 loss: 23.5929
12/20 09:37:02 - mmengine - INFO - Saving checkpoint at 3573 epochs

环境信息

{'sys.platform': 'linux',
'Python': '3.9.0 (default, Nov 15 2020, 14:28:56) [GCC 7.3.0]',
'CUDA available': True,
'numpy_random_seed': 2147483648,
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A10',
'CUDA_HOME': '/usr/local/cuda-11.7',
'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'PyTorch': '1.13.1+cu117',
'TorchVision': '0.14.1+cu117',
'OpenCV': '4.8.0',
'MMEngine': '0.7.3',
'MMCV': '2.0.0',
'MMPreTrain': '1.0.0rc7+e80418a'}

其他信息

my configure file:
mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
base = [
'imagenet_bs512_mocov3.py',
'default_runtime.py',
]

model settings

temperature = 1.0
model = dict(
type='MoCoV3',
base_momentum=0.004, # 0.01 for 100e and 300e, 0.004 for 800 and 1000e
backbone=dict(
type='ResNet',
depth=50,
norm_cfg=dict(type='SyncBN'),
zero_init_residual=False),
neck=dict(
type='NonLinearNeck',
in_channels=2048,
hid_channels=4096,
out_channels=256,
num_layers=2,
with_bias=False,
with_last_bn=True,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=True),
head=dict(
type='MoCoV3Head',
predictor=dict(
type='NonLinearNeck',
in_channels=256,
hid_channels=4096,
out_channels=256,
num_layers=2,
with_bias=False,
with_last_bn=False,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=False),
loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
temperature=temperature))

optimizer

optim_wrapper = dict(
type='AmpOptimWrapper',
loss_scale='dynamic',
optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9),
paramwise_cfg=dict(
custom_keys={
'bn': dict(decay_mult=0, lars_exclude=True),
'bias': dict(decay_mult=0, lars_exclude=True),
# bn layer in ResNet block downsample module
'downsample.1': dict(decay_mult=0, lars_exclude=True),
}),
)

learning rate scheduler

param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-4,
by_epoch=True,
begin=0,
end=10,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=790,
by_epoch=True,
begin=10,
end=4000,
convert_to_iter_based=True)
]

runtime settings

train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=4000)

only keeps the latest 3 checkpoints

default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))

NOTE: auto_scale_lr is for automatically scaling LR

based on the actual training batch size.

auto_scale_lr = dict(base_batch_size=4096)

imagenet_bs512_mocov3.py

dataset settings

dataset_type = 'CustomDataset'
data_root = 'data/yf5class_old/'
data_preprocessor = dict(
type='SelfSupDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True)

view_pipeline1 = [
dict(
type='RandomResizedCrop',
scale=224,
crop_ratio_range=(0.2, 1.),
backend='pillow'),
dict(
type='RandomApply',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
prob=0.8),
dict(
type='RandomGrayscale',
prob=0.2,
keep_channels=True,
channel_weights=(0.114, 0.587, 0.2989)),
dict(
type='GaussianBlur',
magnitude_range=(0.1, 2.0),
magnitude_std='inf',
prob=1.),
dict(type='Solarize', thr=128, prob=0.),
dict(type='RandomFlip', prob=0.5),
]
view_pipeline2 = [
dict(
type='RandomResizedCrop',
scale=224,
crop_ratio_range=(0.2, 1.),
backend='pillow'),
dict(
type='RandomApply',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
prob=0.8),
dict(
type='RandomGrayscale',
prob=0.2,
keep_channels=True,
channel_weights=(0.114, 0.587, 0.2989)),
dict(
type='GaussianBlur',
magnitude_range=(0.1, 2.0),
magnitude_std='inf',
prob=0.1),
dict(type='Solarize', thr=128, prob=0.2),
dict(type='RandomFlip', prob=0.5),
]
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiView',
num_views=[1, 1],
transforms=[view_pipeline1, view_pipeline2]),
dict(type='PackInputs')
]

train_dataloader = dict(
batch_size=192,
num_workers=8,
persistent_workers=True,
pin_memory=True,
sampler=dict(type='DefaultSampler', shuffle=True),
collate_fn=dict(type='default_collate'),
dataset=dict(
type='CustomDataset',
data_root=data_root,
ann_file='', # 我们假定使用子文件夹格式,因此需要将标注文件置空
data_prefix='train',
pipeline=train_pipeline))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant