Unable to resume training #3634

sparshgarg23 · 2024-04-12T18:38:14Z

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.

Reproduction

What command or script did you run?

!python tools/train.py /content/mmsegmentation/configs/segformer/segformer_mit-b3_8xb2-160k_ade20k-512x512.py --resume /content/mmsegmentation/checkpoints/iter_48000.pth


2. Did you make any modifications on the code or config? Did you understand what you have modified?
no
3. What dataset did you use?
ADE20K
**Environment**
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.1+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.1+cu121
OpenCV: 4.8.0
MMEngine: 0.10.3
MMSegmentation: 1.2.2+b040e14


**Error traceback**

If applicable, paste the error trackback here.

usage: train.py [-h] [--work-dir WORK_DIR] [--resume] [--amp]
[--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]]
[--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK]
config
train.py: error: unrecognized arguments: /content/mmsegmentation/checkpoints/iter_48000.pth

The text was updated successfully, but these errors were encountered:

sparshgarg23 · 2024-04-12T18:54:22Z

not sure why it's giving me unrecognized argument error.I tried resuming on dab_detr mmdetection and mmdetection3d and was able to resume training.

I know you guys are busy but would appreciate some insights into this.

Zoulinx · 2024-04-18T01:52:30Z

not sure why it's giving me unrecognized argument error.I tried resuming on dab_detr mmdetection and mmdetection3d and was able to resume training.

I know you guys are busy but would appreciate some insights into this.

The 'resume' accepts a boolean type, meaning whether to resume training based on records, rather than accepting a string address of a model.

sparshgarg23 · 2024-04-18T02:37:41Z

thanks for replying.In order to resume ,i am assuming that I should have the working directory folder with all the contents such as log file,scalar.json as well as the previous checkpoint?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to resume training #3634

Unable to resume training #3634

sparshgarg23 commented Apr 12, 2024

sparshgarg23 commented Apr 12, 2024 •

edited

Zoulinx commented Apr 18, 2024

sparshgarg23 commented Apr 18, 2024

Unable to resume training #3634

Unable to resume training #3634

Comments

sparshgarg23 commented Apr 12, 2024

sparshgarg23 commented Apr 12, 2024 • edited

Zoulinx commented Apr 18, 2024

sparshgarg23 commented Apr 18, 2024

sparshgarg23 commented Apr 12, 2024 •

edited