Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Linear Image classification weights not loading properly: vissl pretrained models simclr or Dcv2 #566

Open
DC95 opened this issue Sep 8, 2022 · 3 comments

Comments

@DC95
Copy link

DC95 commented Sep 8, 2022

Hello Vissl team, @QuentinDuval

Sorry for the trouble.

I have been trying to train a linear classifier using vissl pre-trained models on custom data (simclr and Dcv2). I read #545 and #550 and tried many things but couldn't find the solution.

coming to the main point: a common issue with linear classification training Dc_v2 or simclr vissl pre-trained model is -

vissl/trainer/train_task.py", line 742, in _update_classy_state assert success, "Update classy state from checkpoint failed."
AssertionError: Update classy state from checkpoint failed.

dv2 log file - log_dcv2.txt

Out of curiosity I also checked it with simclr pretrained model and the error is the same, it's not able to load weights properly.
simclr log file - log_simclr.txt

They all direct towards -
Unexpected key(s) in state_dict: "_feature_blocks.conv1.weight",...... vs the Missing key(s) in state_dict: "base_model._feature_blocks.conv1.weight",

The YAML file in txt format-
linear_classifier_k7_g128.txt

I have tried multiple append prefixes but it didn't work plus I believe an append prefix is not needed in a VISSL pretrained model.

Additional info for dcv2

In the beginning, the first error with dcv2 was

RuntimeError: Error(s) in loading state_dict for CrossEntropyMultipleOutputSingleTargetLoss: Unexpected key(s) in state_dict: "local_memory_embeddings", "local_memory_index"

these are the buffer memory variables used in the model and hence get saved inside the loss key.

As while training it for linear classification the loss function changes and hence I delete the 'loss Keys' from the dcv2 pre-trained model and it works.

Kindly help to load the weights for the linear classification task

regards,
DC

environment


sys.platform linux
Python 3.9.6 (default, Nov 16 2021, 12:28:36) [GCC 11.2.0]
numpy 1.21.3
Pillow 9.0.1
vissl 0.1.6 @/p/project/deepacf/kiste/DC/vissl_hdfml2/vissl
GPU available True
GPU 0,1,2,3 Tesla V100-SXM2-32GB
CUDA_HOME /p/software/hdfml/stages/2022/software/CUDA/11.5
torchvision 0.12.0 @/p/software/hdfml/stages/2022/software/torchvision/0.12.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/torchvision
hydra 1.0.7 @/p/project/deepacf/kiste/DC/venv_hdfml2/venv/lib/python3.9/site-packages/hydra
classy_vision 0.7.0.dev @/p/project/deepacf/kiste/DC/venv_hdfml2/venv/lib/python3.9/site-packages/classy_vision
tensorboard 2.10.0
apex 0.1 @/p/project/deepacf/kiste/DC/venv_hdfml2/venv/lib/python3.9/site-packages/apex-0.1-py3.9.egg/apex
cv2 4.6.0
PyTorch 1.11 @/p/software/hdfml/stages/2022/software/PyTorch/1.11-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/torch
PyTorch debug build False


PyTorch built with:

  • GCC 11.2
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.5.2 (Git Hash N/A)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.5
  • NVCC architecture flags: -gencode;arch=compute_70,code=sm_70
  • CuDNN 8.3.1
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.1, CXX_COMPILER=/p/software/hdfml/stages/2022/software/GCCcore/11.2.0/bin/g++, CXX_FLAGS=-O2 -ftree-vectorize -march=haswell -mtune=haswell -fno-math-errno -fopenmp -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.1.11, USE_CUDA=1, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

CPU info:


Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Address sizes 46 bits physical, 48 bits virtual
Byte Order Little Endian
CPU(s) 48
On-line CPU(s) list 0-47
Vendor ID GenuineIntel
Model name Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
CPU family 6
Model 85
Thread(s) per core 2
Core(s) per socket 12
Socket(s) 2
Stepping 4
CPU max MHz 3700,0000
CPU min MHz 1000,0000
BogoMIPS 5200.00
Virtualization VT-x
L1d cache 768 KiB (24 instances)
L1i cache 768 KiB (24 instances)
L2 cache 24 MiB (24 instances)
L3 cache 38,5 MiB (2 instances)
NUMA node(s) 2
NUMA node0 CPU(s) 0-11,24-35
NUMA node1 CPU(s) 12-23,36-47
Vulnerability Itlb multihit KVM
Vulnerability L1tf Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown Mitigation; PTI
Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2 Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds Not affected
Vulnerability Tsx async abort Mitigation; Clear CPU buffers; SMT vulnerable

@QuentinDuval
Copy link
Contributor

Hi @DC95,

First of all, sorry for my slow response time, I had a lot on my plate.
Now it's better :) I will have a look at it !

Could you provide me with:

  • the command line you are using?
  • the dictionary structure of the checkpoint you are trying to evaluate?

For the second point, what I mean by dictionary structure is something like this:

{
  "mode": {
    "state_dict": [name of the weights]
  }
}

Or if you have a public checkpoint that I can use, I can try that as well !

Thank you,
Quentin

@DC95
Copy link
Author

DC95 commented Sep 22, 2022

Hi @QuentinDuval

Thanks for your response :)

The link for DC_v2 checkpoint is (https://gigamove.rwth-aachen.de/en/download/859065e3c72eabd19c45578abfd17ba0)

The link for simclr checkpoint is (https://gigamove.rwth-aachen.de/en/download/cb9771810632728d4f5885f72ac4e551)

The checkpoint structure looks like this:

phase_idx iteration loss iteration_num train_phase_idx classy_state_dict type

The classy_state_dict looks like this:
dict_keys(['train', 'base_model', 'meters', 'optimizer', 'phase_idx', 'train_phase_idx', 'num_updates', 'losses', 'hooks', 'loss', 'train_dataset_iterator', 'amp'])

The trunk keys looks like this : '_feature_blocks.conv1.weight', '_feature_blocks.bn1.weight'

The command line: python run_distributed_engines.py \config=pretrain/linear_classifier/linear_classifier_k7_g128

Thanks for taking time out,
DC

@DC95
Copy link
Author

DC95 commented Sep 27, 2022

Hi @QuentinDuval

Wondering if it was in the right direction. Or was I doing something wrong?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants