Issue about pretraining[return code = -8 ], anyone can help me? #1495

Jeremy-lf · 2024-05-09T08:51:07Z

Question

when i train pretraining, i meet the following probelm, there is no obvious tips, how should i solve it?

ENVS: A800*8, cuda11.6

my train script like this:

sh scripts/v1_5/pretrain.sh

#!/bin/bash

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path vicuna-7b-v1.3 \
    --version plain \
    --data_path /root/paddlejob/workspace/env_run/xx/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json \
    --image_folder /root/paddlejob/workspace/env_run/xx/llava/LLaVA/data/llava-v1.5-7b/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-13b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

error like this:

[2024-05-09 15:33:07,293] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:09,662] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-09 15:33:09,662] [INFO] [runner.py:571:main] cmd = /root/paddlejob/workspace/env_run/anaconda3/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path vicuna-7b-v1.3 --version plain --data_path /root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json --image_folder /root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/llava-v1.5-13b-pretrain --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 24000 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True

[2024-05-09 15:33:10,897] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_GID_INDEX=3
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CONNECT_RETRY_CNT=15
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=22
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.7.8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CUDA_SUPPORT=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_QPS_PER_CONNECTION=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=xgbe0
[2024-05-09 15:33:13,106] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-05-09 15:33:13,106] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-05-09 15:33:13,106] [INFO] [launch.py:163:main] dist_world_size=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-05-09 15:33:17,512] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,649] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,692] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,761] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,771] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,835] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,852] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,948] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,948] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-09 15:33:17,956] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,992] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,037] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,046] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,048] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,178] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,234] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,239] [INFO] [comm.py:637:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.57s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Formatting inputs...Skip in lazy mode
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
wandb: Tracking run with wandb version 0.16.4
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
[2024-05-09 15:34:25,193] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83400
[2024-05-09 15:34:25,584] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83401
[2024-05-09 15:34:25,587] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83402
[2024-05-09 15:34:25,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83403
[2024-05-09 15:34:26,659] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83404
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83405
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83406
[2024-05-09 15:34:26,664] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83407
[2024-05-09 15:34:26,666] [ERROR] [launch.py:321:sigkill_handler] ['/root/paddlejob/workspace/env_run/anaconda3/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', 'vicuna-7b-v1.3', '--version', 'plain', '--data_path', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json', '--image_folder', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/images', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-13b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '24000', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True'] exits with return code = -8

The text was updated successfully, but these errors were encountered:

Jeremy-lf changed the title ~~Issue about pretraining[ERROR] [launch.py:321:sigkill_handler] ], anyone can help me?~~ Issue about pretraining[return code = -8 ], anyone can help me? May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about pretraining[return code = -8 ], anyone can help me? #1495

Issue about pretraining[return code = -8 ], anyone can help me? #1495

Jeremy-lf commented May 9, 2024 •

edited

Issue about pretraining[return code = -8 ], anyone can help me? #1495

Issue about pretraining[return code = -8 ], anyone can help me? #1495

Comments

Jeremy-lf commented May 9, 2024 • edited

Question

Jeremy-lf commented May 9, 2024 •

edited