Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 #3832

xxll88 · 2024-05-20T15:44:55Z

Reminder

I have read the README and searched the existing issues.

Reproduction

model

model_name_or_path: /home/ubuntu/Yi-1.5-34B

method

stage: pt
do_train: true
finetuning_type: freeze

template: default

ddp

ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: qclound,intlcloud
cutoff_len: 1024
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: ../saves/Yi-1.5-34B/ptqcloud2
save_total_limit: 1
logging_steps: 20
save_steps: 1000
plot_loss: true
overwrite_output_dir: false

train

per_device_train_batch_size: 16 #16
gradient_accumulation_steps: 1 #2
learning_rate: 0.0001
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true #bf16

eval

val_size: 0.001
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

examples/deepspeed/ds_z2_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}

Expected behavior

System Info

No response

Others

No response

hiyouga · 2024-05-20T16:19:03Z

更新到最新版代码

xxll88 changed the title ~~Yi-34B模型使用deepspeed zero 训练加载模型时CPU 内存不足导致失败~~ Yi-34B模型使用双卡deepspeed zero2 训练加载模型时CPU 内存不足导致失败 May 20, 2024

xxll88 changed the title ~~Yi-34B模型使用双卡deepspeed zero2 训练加载模型时CPU 内存不足导致失败~~ Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 May 20, 2024

hiyouga added pending This problem is yet to be addressed. labels May 20, 2024

xxll88 closed this as completed May 20, 2024

hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 #3832

Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 #3832

xxll88 commented May 20, 2024 •

edited

hiyouga commented May 20, 2024

Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 #3832

Yi-34B模型使用双卡deepspeed zero2 训练加载模型时占用CPU 内存>200G 不足导致失败 #3832

Comments

xxll88 commented May 20, 2024 • edited

Reminder

Reproduction

model

method

ddp

dataset

output

train

eval

Expected behavior

System Info

Others

hiyouga commented May 20, 2024

xxll88 commented May 20, 2024 •

edited