Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 中断后恢复训练报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #1538

Open
2 tasks done
Helen-Ren-yi opened this issue Apr 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Helen-Ren-yi
Copy link

Prerequisite

Environment

System environment:
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 473525473
GPU 0,1: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.1+cu111
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • CUDA Runtime 11.1

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86

  • CuDNN 8.0.5

  • Magma 2.5.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

    TorchVision: 0.10.1+cu111
    OpenCV: 4.9.0
    MMEngine: 0.10.4

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 4}
dist_cfg: {'backend': 'nccl'}
seed: 473525473
Distributed launcher: none
Distributed training: False
GPU number: 1

Reproduces the problem - code sample

04/30 07:45:22 - mmengine - INFO - Config:
custom_hooks = [
dict(interval=1, type='BasicVisualizationHook'),
]
dataset_type = 'BasicImageDataset'
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=5000,
max_keep_ckpts=10,
out_dir='./work_dirs',
rule=[
'less',
'greater',
'greater',
],
save_best=[
'MAE',
'PSNR',
'SSIM',
],
save_optimizer=True,
type='CheckpointHook'),
logger=dict(interval=100, type='LoggerHook'),
param_scheduler=dict(type='ParamSchedulerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
timer=dict(type='IterTimerHook'))
default_scope = 'mmagic'
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=4))
experiment_name = 'glean_x8_2xb8_cat'
inference_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
backend='pillow',
interpolation='bicubic',
keys=[
'img',
],
scale=(
32,
32,
),
type='Resize'),
dict(type='PackInputs'),
]
launcher = 'none'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False, type='LogProcessor', window_size=100)
model = dict(
data_preprocessor=dict(
mean=[
127.5,
127.5,
127.5,
],
std=[
127.5,
127.5,
127.5,
],
type='DataPreprocessor'),
discriminator=dict(
in_size=256,
init_cfg=dict(
checkpoint=
'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth',
prefix='discriminator',
type='Pretrained'),
type='StyleGANv2Discriminator'),
gan_loss=dict(
fake_label_val=0,
gan_type='vanilla',
loss_weight=0.01,
real_label_val=1.0,
type='GANLoss'),
generator=dict(
in_size=32,
init_cfg=dict(
checkpoint=
'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth',
prefix='generator_ema',
type='Pretrained'),
out_size=256,
style_channels=512,
type='GLEANStyleGANv2'),
perceptual_loss=dict(
criterion='mse',
layer_weights=dict({'21': 1.0}),
norm_img=False,
perceptual_weight=0.01,
pretrained='torchvision://vgg16',
style_weight=0,
type='PerceptualLoss',
vgg_type='vgg16'),
pixel_loss=dict(loss_weight=1.0, reduction='mean', type='MSELoss'),
test_cfg=dict(),
train_cfg=dict(),
type='SRGAN')
model_wrapper_cfg = dict(
find_unused_parameters=True, type='MMSeparateDistributedDataParallel')
optim_wrapper = dict(
constructor='MultiOptimWrapperConstructor',
discriminator=dict(
optimizer=dict(betas=(
0.9,
0.99,
), lr=0.0001, type='Adam'),
type='OptimWrapper'),
generator=dict(
optimizer=dict(betas=(
0.9,
0.99,
), lr=0.0001, type='Adam'),
type='OptimWrapper'))
param_scheduler = dict(
T_max=600000, by_epoch=False, eta_min=1e-07, type='CosineAnnealingLR')
resume = True
save_dir = './work_dirs'
scale = 8
test_cfg = dict(type='MultiTestLoop')
test_dataloader = dict(
dataset=dict(
ann_file='meta_info_Cat100_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_test',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
drop_last=False,
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = [
dict(type='MAE'),
dict(type='PSNR'),
dict(type='SSIM'),
]
test_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
]
train_cfg = dict(
max_iters=300000, type='IterBasedTrainLoop', val_interval=5000)
train_dataloader = dict(
batch_size=8,
dataset=dict(
ann_file='meta_info_LSUNcat_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_train',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(
direction='horizontal',
flip_ratio=0.5,
keys=[
'img',
'gt',
],
type='Flip'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=True, type='InfiniteSampler'))
train_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(
direction='horizontal',
flip_ratio=0.5,
keys=[
'img',
'gt',
],
type='Flip'),
dict(type='PackInputs'),
]
val_cfg = dict(type='MultiValLoop')
val_dataloader = dict(
dataset=dict(
ann_file='meta_info_Cat100_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_test',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
drop_last=False,
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = [
dict(type='MAE'),
dict(type='PSNR'),
dict(type='SSIM'),
]
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
bgr2rgb=True,
fn_key='gt_path',
img_keys=[
'gt_img',
'input',
'pred_img',
],
type='ConcatImageVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = './work_dirs/glean_x8_2xb8_cat'

04/30 07:45:32 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
04/30 07:45:32 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook

before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook

before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val:
(VERY_HIGH ) RuntimeInfoHook

before_val_epoch:
(NORMAL ) IterTimerHook

before_val_iter:
(NORMAL ) IterTimerHook

after_val_iter:
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook

after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_val:
(VERY_HIGH ) RuntimeInfoHook

after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook

before_test:
(VERY_HIGH ) RuntimeInfoHook

before_test_epoch:
(NORMAL ) IterTimerHook

before_test_iter:
(NORMAL ) IterTimerHook

after_test_iter:
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook

after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test:
(VERY_HIGH ) RuntimeInfoHook

after_run:
(BELOW_NORMAL) LoggerHook

04/30 07:45:33 - mmengine - INFO - Working directory: ./work_dirs/glean_x8_2xb8_cat
04/30 07:45:33 - mmengine - INFO - Log directory: /root/glean/work_dirs/glean_x8_2xb8_cat/20240430_074521
04/30 07:45:33 - mmengine - WARNING - cat is not a meta file, simply parsed as meta information
04/30 07:45:33 - mmengine - WARNING - sisr is not a meta file, simply parsed as meta information
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'generator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'generator'.
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'discriminator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'discriminator'.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class MAE.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class PSNR.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class SSIM.
04/30 07:45:36 - mmengine - INFO - load generator_ema in model from: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
Loads checkpoint by http backend from path: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
04/30 07:45:36 - mmengine - WARNING - The model and loaded state dict do not match exactly

Reproduces the problem - command or script

python tools/train.py configs/glean/glean_x8_2xb8_cat.py --resume

Reproduces the problem - error message

04/30 06:43:01 - mmengine - INFO - Saving checkpoint at 275000 iterations
Switch to evaluation style mode: single
04/30 06:43:25 - mmengine - INFO - Iter(val) [100/100] eta: 0:00:00 time: 0.1712 data_time: 0.0235 memor3032
04/30 06:43:26 - mmengine - INFO - Iter(val) [100/100] MAE: 0.0457 PSNR: 23.7792 SSIM: 0.5953 data_time:0234 time: 0.1709
Traceback (most recent call last):
File "tools/train.py", line 114, in
main()
File "tools/train.py", line 107, in main
runner.train()
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1778, in tra
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/loops.py", line 294, in run
self.runner.val_loop.run()
File "/root/glean/mmagic/engine/runner/multi_loops.py", line 247, in run
self._runner.call_hook('after_val_epoch', metrics=multi_metric)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1841, in calook
getattr(hook, fn_name)(self, **kwargs)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 361, after_val_epoch
self._save_best_checkpoint(runner, metrics)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 521, _save_best_checkpoint
if key_score is None or not self.is_better_than[key_indicator](
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 123,
rule_map = {'greater': lambda x, y: x > y, 'less': lambda x, y: x < y}
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Additional information

我在命令行后增加--resume命令后出现的情况,在恢复训练进行5000次迭代后,模型自动保存权重、进行验证,过后打算重新再进入下一个5000次迭代的循环中时,报错,无法继续自动进行训练,报错内容如上。

@Helen-Ren-yi Helen-Ren-yi added the bug Something isn't working label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant