You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[2024-05-09 15:33:10,897] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_GID_INDEX=3
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CONNECT_RETRY_CNT=15
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=22
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.7.8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CUDA_SUPPORT=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_QPS_PER_CONNECTION=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=xgbe0
[2024-05-09 15:33:13,106] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-05-09 15:33:13,106] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-05-09 15:33:13,106] [INFO] [launch.py:163:main] dist_world_size=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-05-09 15:33:17,512] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,649] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,692] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,761] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,771] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,835] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,852] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,948] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,948] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-09 15:33:17,956] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,992] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,037] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,046] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,048] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,178] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,234] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,239] [INFO] [comm.py:637:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.57s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Formatting inputs...Skip in lazy mode
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
wandb: Tracking run with wandb version 0.16.4
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
[2024-05-09 15:34:25,193] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83400
[2024-05-09 15:34:25,584] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83401
[2024-05-09 15:34:25,587] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83402
[2024-05-09 15:34:25,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83403
[2024-05-09 15:34:26,659] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83404
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83405
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83406
[2024-05-09 15:34:26,664] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83407
[2024-05-09 15:34:26,666] [ERROR] [launch.py:321:sigkill_handler] ['/root/paddlejob/workspace/env_run/anaconda3/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', 'vicuna-7b-v1.3', '--version', 'plain', '--data_path', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json', '--image_folder', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/images', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-13b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '24000', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True'] exits with return code = -8
The text was updated successfully, but these errors were encountered:
Jeremy-lf
changed the title
Issue about pretraining[ERROR] [launch.py:321:sigkill_handler] ], anyone can help me?
Issue about pretraining[return code = -8 ], anyone can help me?
May 9, 2024
Question
when i train pretraining, i meet the following probelm, there is no obvious tips, how should i solve it?
ENVS: A800*8, cuda11.6
my train script like this:
sh scripts/v1_5/pretrain.sh
error like this:
[2024-05-09 15:33:07,293] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:09,662] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-09 15:33:09,662] [INFO] [runner.py:571:main] cmd = /root/paddlejob/workspace/env_run/anaconda3/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path vicuna-7b-v1.3 --version plain --data_path /root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json --image_folder /root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/llava-v1.5-13b-pretrain --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 24000 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True
[2024-05-09 15:33:10,897] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_GID_INDEX=3
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CONNECT_RETRY_CNT=15
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=22
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.7.8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_CUDA_SUPPORT=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_IB_QPS_PER_CONNECTION=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-05-09 15:33:13,106] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=xgbe0
[2024-05-09 15:33:13,106] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-05-09 15:33:13,106] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-05-09 15:33:13,106] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-05-09 15:33:13,106] [INFO] [launch.py:163:main] dist_world_size=8
[2024-05-09 15:33:13,106] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-05-09 15:33:17,512] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,649] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,692] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,761] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,771] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,835] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,852] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:17,948] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,948] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-09 15:33:17,956] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:17,992] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,037] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,046] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,048] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 15:33:18,178] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,234] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 15:33:18,239] [INFO] [comm.py:637:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.57s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565Formatting inputs...Skip in lazy mode
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
wandb: Tracking run with wandb version 0.16.4
wandb: W&B syncing is set to
offline
in this directory.wandb: Run
wandb online
or set WANDB_MODE=online to enable cloud syncing.[2024-05-09 15:34:25,193] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83400
[2024-05-09 15:34:25,584] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83401
[2024-05-09 15:34:25,587] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83402
[2024-05-09 15:34:25,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83403
[2024-05-09 15:34:26,659] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83404
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83405
[2024-05-09 15:34:26,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83406
[2024-05-09 15:34:26,664] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 83407
[2024-05-09 15:34:26,666] [ERROR] [launch.py:321:sigkill_handler] ['/root/paddlejob/workspace/env_run/anaconda3/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', 'vicuna-7b-v1.3', '--version', 'plain', '--data_path', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/blip_laion_cc_sbu_558k.json', '--image_folder', '/root/paddlejob/workspace/env_run/lvfeng/llava/LLaVA/data/llava-v1.5-7b/images', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-13b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '24000', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True'] exits with return code = -8
The text was updated successfully, but these errors were encountered: