Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Incosistent results when running inference on trained resnet model #562

Open
BartvanMarrewijk opened this issue Aug 12, 2022 · 1 comment

Comments

@BartvanMarrewijk
Copy link

Instructions To Reproduce the Issue:

I have trained a pretrained ImageNet resnet on a custom dataset with 12 classes.
For training I used following yaml file:
training_yaml_file
In this yaml file I only changed the of the head layer:

  HEAD:
    PARAMS: [
      ["eval_mlp", {"in_channels": 64, "dims": [9216, 12]}],
      ["eval_mlp", {"in_channels": 256, "dims": [9216, 12]}],
      ["eval_mlp", {"in_channels": 512, "dims": [8192, 12]}],
      ["eval_mlp", {"in_channels": 1024, "dims": [9216, 12]}],
      ["eval_mlp", {"in_channels": 2048, "dims": [8192, 12]}],
    ]

Then I trained the model with following commands:

!python3 /home/ubuntu2004/vissl/tools/run_distributed_engines.py \
    hydra.verbose=true \
    config=custom_configs/eval_resnet_8gpu_transfer_in1k_linear.yaml \
    config.MODEL.WEIGHTS_INIT.APPEND_PREFIX=trunk.base_model._feature_blocks.\
    config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=''\
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/model_zoo/resnet50-19c8e357_supervised_Imagenet1k.pth"\
    config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch\
    config.MODEL.SYNC_BN_CONFIG.GROUP_SIZE=-1\
    config.DATA.TRAIN.DATASET_NAMES=[custom_imagenet] \
    config.DATA.TEST.DATASET_NAMES=[custom_imagenet] \
    config.DISTRIBUTED.NUM_NODES=1 \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \
    config.DISTRIBUTED.RUN_ID=auto \
    config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true \
    config.CHECKPOINT.DIR="/content/checkpoints/linear_superresnet12class"

While training the one-top-accuracy for the 5 different heads ranges between 80 and 90%
For running inference I used following code, slighty adapted from second part of Inference_tutorial

from omegaconf import OmegaConf
from vissl.utils.hydra_config import AttrDict

from vissl.utils.hydra_config import compose_hydra_configuration, convert_to_attrdict

# Config is located at vissl/configs/config/pretrain/simclr/simclr_8node_resnet.yaml.
# All other options override the simclr_8node_resnet.yaml config.

cfg = [
  'config=custom_configs/eval_resnet_8gpu_transfer_in1k_linear_bart.yaml',
  'config.MODEL.WEIGHTS_INIT.PARAMS_FILE=/content/checkpoints/linear_superresnet12class/model_phase26.torch',
  'config.MODEL.FEATURE_EVAL_SETTINGS.EVAL_MODE_ON=True', # Turn on model evaluation mode.
  'config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_AND_HEAD=True', # Freeze trunk. 
  'config.MODEL.FEATURE_EVAL_SETTINGS.EVAL_TRUNK_AND_HEAD=True', # Extract the trunk features, as opposed to the HEAD.
]

# Compose the hydra configuration.
cfg = compose_hydra_configuration(cfg)
# Convert to AttrDict. This method will also infer certain config options
# and validate the config is valid.
_, cfg = convert_to_attrdict(cfg)

# Build the model
from vissl.models import build_model
from vissl.utils.checkpoint import init_model_from_consolidated_weights

from classy_vision.generic.util import load_checkpoint

model = build_model(cfg.MODEL, cfg.OPTIMIZER)

# Load the checkpoint weights.
weights = load_checkpoint(checkpoint_path=cfg.MODEL.WEIGHTS_INIT.PARAMS_FILE)

# Initializei the model with the simclr model weights.
init_model_from_consolidated_weights(
    config=cfg,
    model=model,
    state_dict=weights,
    state_dict_key_name="classy_state_dict",
    skip_layers=[],  # Use this if you do not want to load all layers
)

print("Weights have loaded")
# model.heads[0].clf.clf[0].weight
model.trunk.base_model._feature_blocks.conv1.weight[0][0][0]

Problem

Every time I am loading the model the weights are different

model.trunk.base_model._feature_blocks.conv1.weight[0][0][0]
tensor([ 0.0585, -0.0269, -0.0202,  0.0078, -0.0273,  0.0771,  0.0107])

reload again:

model.trunk.base_model._feature_blocks.conv1.weight[0][0][0]
tensor([-0.0181,  0.0242, -0.0541, -0.0252, -0.0747,  0.0054, -0.0472])

In addition, the loaded mode has the ResNext, but is a resnet. As a result of the changing weights, my output is quite random. Anybody a solution?

Environment:

Provide your environment information using the following command:

sys.platform         linux
Python               3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
numpy                1.19.5
Pillow               9.2.0
vissl                0.1.6 @/home/ubuntu2004/vissl/vissl
GPU available        True
GPU 0,1,2,3          NVIDIA GeForce GTX TITAN X
CUDA_HOME            /usr/local/cuda-11.1
torchvision          0.9.1 @/home/ubuntu2004/anaconda3/envs/swav/lib/python3.8/site-packages/torchvision
hydra                1.0.7 @/home/ubuntu2004/anaconda3/envs/swav/lib/python3.8/site-packages/hydra
classy_vision        0.7.0.dev @/home/ubuntu2004/anaconda3/envs/swav/lib/python3.8/site-packages/classy_vision
tensorboard          2.9.1
apex                 0.1 @/home/ubuntu2004/anaconda3/envs/swav/lib/python3.8/site-packages/apex
cv2                  4.6.0
PyTorch              1.8.1 @/home/ubuntu2004/anaconda3/envs/swav/lib/python3.8/site-packages/torch
PyTorch debug build  False
-------------------  -----------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------------------  ---------------------------------------------------------------------------------
Architecture                     x86_64
CPU op-mode(s)                   32-bit, 64-bit
Byte Order                       Little Endian
Address sizes                    46 bits physical, 48 bits virtual
CPU(s)                           12
On-line CPU(s) list              0-11
Thread(s) per core               2
Core(s) per socket               6
Socket(s)                        1
NUMA node(s)                     1
Vendor ID                        GenuineIntel
CPU family                       6
Model                            63
Model name                       Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Stepping                         2
CPU MHz                          1600.000
CPU max MHz                      4100,0000
CPU min MHz                      1200,0000
BogoMIPS                         7000.16
L1d cache                        192 KiB
L1i cache                        192 KiB
L2 cache                         1,5 MiB
L3 cache                         15 MiB
NUMA node0 CPU(s)                0-11
Vulnerability Itlb multihit      KVM
Vulnerability L1tf               Mitigation; PTE Inversion
Vulnerability Mds                Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown           Mitigation; PTI
Vulnerability Mmio stale data    Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Spec store bypass  Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1         Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2         Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds              Not affected
Vulnerability Tsx async abort    Not affected

@QuentinDuval QuentinDuval self-assigned this Jan 6, 2023
@QuentinDuval
Copy link
Contributor

Hello @studentWUR,

First of all, thank you for using VISSL and raising your question :)
(and sorry for the delay in my answer...)

It would seem from your description that the model is not correctly loaded and so the weights are random, hence the random accuracy at the end. Could you check the logs and grep the line "Extra layers not loaded from checkpoint"? It will indicate if the weights are not loaded.

Thank you,
Quentin

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants