Training problem #721

yonadance · 2024-04-06T05:20:03Z

training problem:

我使用Visdrone数据集进行训练遇到了问题，在执行

python tools/train_net.py --config-file ./configs/Visdrone/sbs_R50-ibn.yml MODEL.DEVICE "cuda:0"

之后并没有产生报错但也没有进行到iteration中进行训练。
2. 由于在windows系统中没有进行make all的那一步操作
3. 全部的log内容如下：

Command Line Args: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
[04/06 13:08:42 fastreid]: Rank of current process: 0. World size: 1
[04/06 13:08:43 fastreid]: Environment info:
----------------------  ------------------------------------------------------------------------------------
sys.platform            win32
Python                  3.7.16 (default, Jan 17 2023, 16:06:28) [MSC v.1916 64 bit (AMD64)]
numpy                   1.21.6
fastreid                1.3 @.\fastreid
FASTREID_ENV_MODULE     <not set>
PyTorch                 1.13.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torch
PyTorch debug build     False
GPU available           True
GPU 0                   NVIDIA GeForce RTX 3080
CUDA_HOME               C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7
Pillow                  9.5.0
torchvision             0.14.1+cu117 @D:\anaconda\envs\BOTsort\lib\site-packages\torchvision
torchvision arch flags  D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\_C.pyd; cannot find cuobjdump
cv2                     4.9.0
----------------------  ------------------------------------------------------------------------------------
PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

[04/06 13:08:43 fastreid]: Command line arguments: Namespace(config_file='./configs/Visdrone/sbs_R50-ibn.yml', dist_url='tcp://127.0.0.1:49153', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.DEVICE', 'cuda:0'], resume=False)
[04/06 13:08:43 fastreid]: Contents of args.config_file=./configs/Visdrone/sbs_R50-ibn.yml:
b'# _*_ coding:utf-8 _*_\r\n_BASE_: ../Base-SBS.yml\r\n\r\n# \xe8\xae\xbe\xe7\xbd\xae\xe7\x9b\xb8\xe5\xba\x94\xe7\x9a\x84\xe6\x95\xb0\xe6\x8d\xae\xe5\xa2\x9e\xe5\xbc\xba\r\nINPUT:\r\n  SIZE_TRAIN: [256, 256]\r\n  SIZE_TEST: [256, 256]\r\n\r\nMODEL:\r\n  BACKBONE:\r\n    WITH_IBN: True\r\n    WITH_NL: True #\xe6\xa8\xa1\xe5\x9e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe4\xbd\xbf\xe7\x94\xa8No_local module\r\n    PRETRAIN: True\r\n    PRETRAIN_PATH: \'pretrained\\veri_sbs_R50-ibn.pth\'\r\n  HEADS:\r\n    POOL_LAYER: GeneralizedMeanPooling # HEAD POOL_LAYERS\r\n  LOSSES:\r\n    NAME: ("CrossEntropyLoss", "TripletLoss",)\r\n    CE:\r\n      EPSILON: 0.1\r\n      SCALE: 1.0\r\n\r\n    TRI:\r\n      MARGIN: 0.0  # \xe8\x80\x83\xe8\x99\x91\xe8\xa6\x81\xe4\xb8\x8d\xe8\xa6\x81\xe8\xbf\x9b\xe8\xa1\x8c\xe5\xaf\xb9\xe5\xba\x94\xe7\x9a\x84\xe8\xb6\x85\xe5\x8f\x82\xe6\x95\xb0\xe7\x9a\x84\xe8\xb0\x83\xe6\x95\xb4\r\n      HARD_MINING: True\r\n      NORM_FEAT: False\r\n      SCALE: 1.0\r\nSOLVER:\r\n  OPT: SGD\r\n  BASE_LR: 0.0001# 0.01\r\n  ETA_MIN_LR: 7.7e-5\r\n\r\n  IMS_PER_BATCH: 128 # batchsize\r\n  MAX_EPOCH: 10 # 60\r\n  WARMUP_ITERS: 3000\r\n  FREEZE_ITERS: 3000\r\n\r\n  CHECKPOINT_PERIOD: 10\r\n\r\nDATASETS:\r\n  NAMES: ("Visdrone",)\r\n  TESTS: ("Visdrone",)\r\n\r\nDATALOADER:\r\n  SAMPLER_TRAIN: BalancedIdentitySampler\r\n  NUM_INSTANCE: 4\r\n  NUM_WORKERS: 8\r\nTEST:\r\n  EVAL_PERIOD: 10\r\n  IMS_PER_BATCH: 256 # 256\r\n\r\nOUTPUT_DIR: logs/visdrone/sbs_R50-ibn'
[04/06 13:08:43 fastreid]: Running with full config:
CUDNN_BENCHMARK: False
DATALOADER:
  NUM_INSTANCE: 4
  NUM_WORKERS: 8
  SAMPLER_TRAIN: BalancedIdentitySampler
  SET_WEIGHT: []
DATASETS:
  COMBINEALL: False
  NAMES: ('Visdrone',)
  TESTS: ('Visdrone',)
INPUT:
  AFFINE:
    ENABLED: False
  AUGMIX:
    ENABLED: False
    PROB: 0.0
  AUTOAUG:
    ENABLED: True
    PROB: 0.1
  CJ:
    BRIGHTNESS: 0.15
    CONTRAST: 0.15
    ENABLED: False
    HUE: 0.1
    PROB: 0.5
    SATURATION: 0.1
  CROP:
    ENABLED: False
    RATIO: [0.75, 1.3333333333333333]
    SCALE: [0.16, 1]
    SIZE: [224, 224]
  FLIP:
    ENABLED: True
    PROB: 0.5
  PADDING:
    ENABLED: True
    MODE: constant
    SIZE: 10
  REA:
    ENABLED: True
    PROB: 0.5
    VALUE: [123.675, 116.28, 103.53]
  RPT:
    ENABLED: False
    PROB: 0.5
  SIZE_TEST: [256, 256]
  SIZE_TRAIN: [256, 256]
KD:
  EMA:
    ENABLED: False
    MOMENTUM: 0.999
  MODEL_CONFIG: []
  MODEL_WEIGHTS: []
MODEL:
  BACKBONE:
    ATT_DROP_RATE: 0.0
    DEPTH: 50x
    DROP_PATH_RATIO: 0.1
    DROP_RATIO: 0.0
    FEAT_DIM: 2048
    LAST_STRIDE: 1
    NAME: build_resnet_backbone
    NORM: BN
    PRETRAIN: True
    PRETRAIN_PATH: pretrained\veri_sbs_R50-ibn.pth
    SIE_COE: 3.0
    STRIDE_SIZE: (16, 16)
    WITH_IBN: True
    WITH_NL: True
    WITH_SE: False
  DEVICE: cuda:0
  FREEZE_LAYERS: ['backbone']
  HEADS:
    CLS_LAYER: CircleSoftmax
    EMBEDDING_DIM: 0
    MARGIN: 0.35
    NAME: EmbeddingHead
    NECK_FEAT: after
    NORM: BN
    NUM_CLASSES: 0
    POOL_LAYER: GeneralizedMeanPooling
    SCALE: 64
    WITH_BNNECK: True
  LOSSES:
    CE:
      ALPHA: 0.2
      EPSILON: 0.1
      SCALE: 1.0
    CIRCLE:
      GAMMA: 128
      MARGIN: 0.25
      SCALE: 1.0
    COSFACE:
      GAMMA: 128
      MARGIN: 0.25
      SCALE: 1.0
    FL:
      ALPHA: 0.25
      GAMMA: 2
      SCALE: 1.0
    NAME: ('CrossEntropyLoss', 'TripletLoss')
    TRI:
      HARD_MINING: True
      MARGIN: 0.0
      NORM_FEAT: False
      SCALE: 1.0
  META_ARCHITECTURE: Baseline
  PIXEL_MEAN: [123.675, 116.28, 103.53]
  PIXEL_STD: [58.395, 57.120000000000005, 57.375]
  QUEUE_SIZE: 8192
  WEIGHTS:
OUTPUT_DIR: logs/visdrone/sbs_R50-ibn
SOLVER:
  AMP:
    ENABLED: True
  BASE_LR: 0.0001
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 10
  CLIP_GRADIENTS:
    CLIP_TYPE: norm
    CLIP_VALUE: 5.0
    ENABLED: False
    NORM_TYPE: 2.0
  DELAY_EPOCHS: 30
  ETA_MIN_LR: 7.7e-05
  FREEZE_ITERS: 3000
  GAMMA: 0.1
  HEADS_LR_FACTOR: 1.0
  IMS_PER_BATCH: 128
  MAX_EPOCH: 10
  MOMENTUM: 0.9
  NESTEROV: False
  OPT: SGD
  SCHED: CosineAnnealingLR
  STEPS: [40, 90]
  WARMUP_FACTOR: 0.1
  WARMUP_ITERS: 3000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0005
  WEIGHT_DECAY_BIAS: 0.0005
  WEIGHT_DECAY_NORM: 0.0005
TEST:
  AQE:
    ALPHA: 3.0
    ENABLED: False
    QE_K: 5
    QE_TIME: 1
  EVAL_PERIOD: 10
  FLIP:
    ENABLED: False
  IMS_PER_BATCH: 256
  METRIC: cosine
  PRECISE_BN:
    DATASET: Market1501
    ENABLED: False
    NUM_ITER: 300
  RERANK:
    ENABLED: False
    K1: 20
    K2: 6
    LAMBDA: 0.3
  ROC:
    ENABLED: False
[04/06 13:08:43 fastreid]: Full config saved to D:\zhuangshilin\BoT_SORT\fast_reid\logs\visdrone\sbs_R50-ibn\config.yaml
D:\anaconda\envs\BOTsort\lib\site-packages\torchvision\transforms\transforms.py:330: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
  "Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "

Expected behavior:

之后程序就卡在这里不再更新log了，查看gpu也只有10%并没有跑起来，尝试在自己写的dataset.py里面print也是跟在后面显示出来后就没有再进一步，想知道怎么才能找到程序究竟卡在哪里

The text was updated successfully, but these errors were encountered:

yonadance · 2024-04-06T09:36:15Z

设置断点调试后发现卡在了：
fastreid.engine.train_loop 中的 class AMPTrainer中的
super().__init__(model, data_loader, optimizer, param_wrapper)
无法执行下去

yonadance · 2024-04-06T13:19:29Z

修改IMS_PER_BATCH后可以了，但是多个iter之后loss还是=0

yonadance · 2024-04-06T15:25:36Z

提问：数据集的id如果为1会有什么问题呢

github-actions · 2024-05-07T02:11:24Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-05-22T02:10:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label May 7, 2024

github-actions bot closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training problem #721

Training problem #721

yonadance commented Apr 6, 2024

yonadance commented Apr 6, 2024

yonadance commented Apr 6, 2024

yonadance commented Apr 6, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 22, 2024

Training problem #721

Training problem #721

Comments

yonadance commented Apr 6, 2024