Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 分布式训练代码例子报错, #1540

Closed
2 tasks done
apachemycat opened this issue May 7, 2024 · 5 comments
Closed
2 tasks done

[Bug] 分布式训练代码例子报错, #1540

apachemycat opened this issue May 7, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@apachemycat
Copy link

Prerequisite

Environment

pytorch 2.3 cuda 12.3 gpu train

Reproduces the problem - code sample

https://github.com/open-mmlab/mmengine/blob/main/examples/llama2/fsdp_finetune.py
修改为训练 书生模型
# Prepare model for internlm2 by wuzhhui
model, tokenizer = build_model(
model_name_or_path=args.checkpoint,
return_tokenizer=True)

# Prepare model for llama
#tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint)
#tokenizer.add_special_tokens({'pad_token': '<PAD>'})
#model = LlamaForCausalLM.from_pretrained(args.checkpoint)

Reproduces the problem - command or script

LOGLEVEL=DEBUG NPROC_PER_NODE=1 torchrun fsdp_finetune.py /models/instruct-finetrain.json /models/internlm2-1_8b --max-epoch 100 --save-interval 50 --output-dir ${work_dir}

Reproduces the problem - error message

RuntimeError: "amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
[rank0]: Traceback (most recent call last):
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 185, in
[rank0]: train()
[rank0]: File "/models/internlm2-1_8b_fsdp_train/fsdp_finetune.py", line 161, in train
[rank0]: optimizer.update_params(loss)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 201, in update_params
[rank0]: self.step(**step_kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
[rank0]: return wrapped(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/optimizer/amp_optimizer_wrapper.py", line 137, in step
[rank0]: self.loss_scaler.unscale
(self.optimizer)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 278, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 243, in unscale_grads
[rank0]: torch.amp_foreach_non_finite_check_and_unscale(
[rank0]: RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

Additional information

maybe model or lib

@apachemycat apachemycat added the bug Something isn't working label May 7, 2024
@zhouzaida
Copy link
Member

请问你的显卡是什么型号

@zhouzaida
Copy link
Member

有可能是你的显卡不支持 Bfloat16 计算

@zhouzaida
Copy link
Member

另外,如果你想要微调 InternLM 模型,推荐使用 XTuner (https://github.com/InternLM/xtuner)

@apachemycat
Copy link
Author

GPU 1: Tesla V100-PCIE-32GB

@zhouzaida
Copy link
Member

GPU 1: Tesla V100-PCIE-32GB

V100 应该是不支持 Bfloat16 的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants