Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training problem #721

Closed
yonadance opened this issue Apr 6, 2024 · 5 comments
Closed

Training problem #721

yonadance opened this issue Apr 6, 2024 · 5 comments
Labels

Comments

@yonadance
Copy link

training problem:

  1. 我使用Visdrone数据集进行训练遇到了问题,在执行
python tools/train_net.py --config-file ./configs/Visdrone/sbs_R50-ibn.yml MODEL.DEVICE "cuda:0"

之后并没有产生报错但也没有进行到iteration中进行训练。
2. 由于在windows系统中没有进行make all的那一步操作
3. 全部的log内容如下:

Command Line Args: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
[04/06 13:08:42 fastreid]: Rank of current process: 0. World size: 1
[04/06 13:08:43 fastreid]: Environment info:
----------------------  ------------------------------------------------------------------------------------
sys.platform            win32
Python                  3.7.16 (default, Jan 17 2023, 16:06:28) [MSC v.1916 64 bit (AMD64)]
numpy                   1.21.6
fastreid                1.3 @.\fastreid
FASTREID_ENV_MODULE     <not set>
PyTorch                 1.13.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torch
PyTorch debug build     False
GPU available           True
GPU 0                   NVIDIA GeForce RTX 3080
CUDA_HOME               C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7
Pillow                  9.5.0
torchvision             0.14.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torchvision
torchvision arch flags  D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\_C.pyd; cannot find cuobjdump
cv2                     4.9.0
----------------------  ------------------------------------------------------------------------------------
PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

[04/06 13:08:43 fastreid]: Command line arguments: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
[04/06 13:08:43 fastreid]: Contents of args.config_file=./configs/Visdrone/sbs_R50-ibn.yml:
b'# _*_ coding:utf-8 _*_\r\n_BASE_: ../Base-SBS.yml\r\n\r\n# \xe8\xae\xbe\xe7\xbd\xae\xe7\x9b\xb8\xe5\xba\x94\xe7\x9a\x84\xe6\x95\xb0\xe6\x8d\xae\xe5\xa2\x9e\xe5\xbc\xba\r\nINPUT:\r\n  SIZE_TRAIN: [256, 256]\r\n  SIZE_TEST: [256, 256]\r\n\r\nMODEL:\r\n  BACKBONE:\r\n    WITH_IBN: True\r\n    WITH_NL: True #\xe6\xa8\xa1\xe5\x9e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe4\xbd\xbf\xe7\x94\xa8No_local module\r\n    PRETRAIN: True\r\n    PRETRAIN_PATH: \'pretrained\\veri_sbs_R50-ibn.pth\'\r\n  HEADS:\r\n    POOL_LAYER: GeneralizedMeanPooling # HEAD POOL_LAYERS\r\n  LOSSES:\r\n    NAME: ("CrossEntropyLoss", "TripletLoss",)\r\n    CE:\r\n      EPSILON: 0.1\r\n      SCALE: 1.0\r\n\r\n    TRI:\r\n      MARGIN: 0.0  # \xe8\x80\x83\xe8\x99\x91\xe8\xa6\x81\xe4\xb8\x8d\xe8\xa6\x81\xe8\xbf\x9b\xe8\xa1\x8c\xe5\xaf\xb9\xe5\xba\x94\xe7\x9a\x84\xe8\xb6\x85\xe5\x8f\x82\xe6\x95\xb0\xe7\x9a\x84\xe8\xb0\x83\xe6\x95\xb4\r\n      HARD_MINING: True\r\n      NORM_FEAT: False\r\n      SCALE: 1.0\r\nSOLVER:\r\n  OPT: SGD\r\n  BASE_LR: 0.0001# 0.01\r\n  ETA_MIN_LR: 7.7e-5\r\n\r\n  IMS_PER_BATCH: 128 # batchsize\r\n  MAX_EPOCH: 10 # 60\r\n  WARMUP_ITERS: 3000\r\n  FREEZE_ITERS: 3000\r\n\r\n  CHECKPOINT_PERIOD: 10\r\n\r\nDATASETS:\r\n  NAMES: ("Visdrone",)\r\n  TESTS: ("Visdrone",)\r\n\r\nDATALOADER:\r\n  SAMPLER_TRAIN: BalancedIdentitySampler\r\n  NUM_INSTANCE: 4\r\n  NUM_WORKERS: 8\r\nTEST:\r\n  EVAL_PERIOD: 10\r\n  IMS_PER_BATCH: 256 # 256\r\n\r\nOUTPUT_DIR: logs/visdrone/sbs_R50-ibn'
[04/06 13:08:43 fastreid]: Running with full config:
CUDNN_BENCHMARK: False
DATALOADER:
  NUM_INSTANCE: 4
  NUM_WORKERS: 8
  SAMPLER_TRAIN: BalancedIdentitySampler
  SET_WEIGHT: []
DATASETS:
  COMBINEALL: False
  NAMES: ('Visdrone',)
  TESTS: ('Visdrone',)
INPUT:
  AFFINE:
    ENABLED: False
  AUGMIX:
    ENABLED: False
    PROB: 0.0
  AUTOAUG:
    ENABLED: True
    PROB: 0.1
  CJ:
    BRIGHTNESS: 0.15
    CONTRAST: 0.15
    ENABLED: False
    HUE: 0.1
    PROB: 0.5
    SATURATION: 0.1
  CROP:
    ENABLED: False
    RATIO: [0.75, 1.3333333333333333]
    SCALE: [0.16, 1]
    SIZE: [224, 224]
  FLIP:
    ENABLED: True
    PROB: 0.5
  PADDING:
    ENABLED: True
    MODE: constant
    SIZE: 10
  REA:
    ENABLED: True
    PROB: 0.5
    VALUE: [123.675, 116.28, 103.53]
  RPT:
    ENABLED: False
    PROB: 0.5
  SIZE_TEST: [256, 256]
  SIZE_TRAIN: [256, 256]
KD:
  EMA:
    ENABLED: False
    MOMENTUM: 0.999
  MODEL_CONFIG: []
  MODEL_WEIGHTS: []
MODEL:
  BACKBONE:
    ATT_DROP_RATE: 0.0
    DEPTH: 50x
    DROP_PATH_RATIO: 0.1
    DROP_RATIO: 0.0
    FEAT_DIM: 2048
    LAST_STRIDE: 1
    NAME: build_resnet_backbone
    NORM: BN
    PRETRAIN: True
    PRETRAIN_PATH: pretrained\veri_sbs_R50-ibn.pth
    SIE_COE: 3.0
    STRIDE_SIZE: (16, 16)
    WITH_IBN: True
    WITH_NL: True
    WITH_SE: False
  DEVICE: cuda:0
  FREEZE_LAYERS: ['backbone']
  HEADS:
    CLS_LAYER: CircleSoftmax
    EMBEDDING_DIM: 0
    MARGIN: 0.35
    NAME: EmbeddingHead
    NECK_FEAT: after
    NORM: BN
    NUM_CLASSES: 0
    POOL_LAYER: GeneralizedMeanPooling
    SCALE: 64
    WITH_BNNECK: True
  LOSSES:
    CE:
      ALPHA: 0.2
      EPSILON: 0.1
      SCALE: 1.0
    CIRCLE:
      GAMMA: 128
      MARGIN: 0.25
      SCALE: 1.0
    COSFACE:
      GAMMA: 128
      MARGIN: 0.25
      SCALE: 1.0
    FL:
      ALPHA: 0.25
      GAMMA: 2
      SCALE: 1.0
    NAME: ('CrossEntropyLoss', 'TripletLoss')
    TRI:
      HARD_MINING: True
      MARGIN: 0.0
      NORM_FEAT: False
      SCALE: 1.0
  META_ARCHITECTURE: Baseline
  PIXEL_MEAN: [123.675, 116.28, 103.53]
  PIXEL_STD: [58.395, 57.120000000000005, 57.375]
  QUEUE_SIZE: 8192
  WEIGHTS:
OUTPUT_DIR: logs/visdrone/sbs_R50-ibn
SOLVER:
  AMP:
    ENABLED: True
  BASE_LR: 0.0001
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 10
  CLIP_GRADIENTS:
    CLIP_TYPE: norm
    CLIP_VALUE: 5.0
    ENABLED: False
    NORM_TYPE: 2.0
  DELAY_EPOCHS: 30
  ETA_MIN_LR: 7.7e-05
  FREEZE_ITERS: 3000
  GAMMA: 0.1
  HEADS_LR_FACTOR: 1.0
  IMS_PER_BATCH: 128
  MAX_EPOCH: 10
  MOMENTUM: 0.9
  NESTEROV: False
  OPT: SGD
  SCHED: CosineAnnealingLR
  STEPS: [40, 90]
  WARMUP_FACTOR: 0.1
  WARMUP_ITERS: 3000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0005
  WEIGHT_DECAY_BIAS: 0.0005
  WEIGHT_DECAY_NORM: 0.0005
TEST:
  AQE:
    ALPHA: 3.0
    ENABLED: False
    QE_K: 5
    QE_TIME: 1
  EVAL_PERIOD: 10
  FLIP:
    ENABLED: False
  IMS_PER_BATCH: 256
  METRIC: cosine
  PRECISE_BN:
    DATASET: Market1501
    ENABLED: False
    NUM_ITER: 300
  RERANK:
    ENABLED: False
    K1: 20
    K2: 6
    LAMBDA: 0.3
  ROC:
    ENABLED: False
[04/06 13:08:43 fastreid]: Full config saved to D:\zhuangshilin\BoT_SORT\fast_reid\logs\visdrone\sbs_R50-ibn\config.yaml
D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\transforms\transforms.py:330: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
  "Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "

Expected behavior:

之后程序就卡在这里不再更新log了,查看gpu也只有10%并没有跑起来,尝试在自己写的dataset.py里面print也是跟在后面显示出来后就没有再进一步,想知道怎么才能找到程序究竟卡在哪里

@yonadance
Copy link
Author

设置断点调试后发现卡在了:
fastreid.engine.train_loop 中的 class AMPTrainer中的
super().__init__(model, data_loader, optimizer, param_wrapper)
无法执行下去

@yonadance
Copy link
Author

修改IMS_PER_BATCH后可以了,但是多个iter之后loss还是=0

@yonadance
Copy link
Author

提问:数据集的id如果为1会有什么问题呢

Copy link

github-actions bot commented May 7, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label May 7, 2024
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant