Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors while in finetune intermlm2-chat-20b with qlora #3798

Open
1 task done
a1exyu opened this issue May 17, 2024 · 1 comment
Open
1 task done

errors while in finetune intermlm2-chat-20b with qlora #3798

a1exyu opened this issue May 17, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@a1exyu
Copy link

a1exyu commented May 17, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=1 llamafactory-cli example/......
below is the yaml file:

model

model_name_or_path: /home/ybh/ybh/models/internlm2-chat-20b
quantization_bit: 4

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: wqkv

dataset

dataset: text_classification_coarse
template: intern2
cutoff_len: 6144
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /home/ybh/ybh/nlpcc/LLaMA-Factory/saves/internlm2-chat-20b/qlora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

eval

val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 10

Expected behavior

No response

System Info

[INFO|trainer.py:2048] 2024-05-18 00:07:10,006 >> ***** Running training *****
[INFO|trainer.py:2049] 2024-05-18 00:07:10,006 >> Num examples = 122
[INFO|trainer.py:2050] 2024-05-18 00:07:10,006 >> Num Epochs = 5
[INFO|trainer.py:2051] 2024-05-18 00:07:10,006 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2054] 2024-05-18 00:07:10,006 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2055] 2024-05-18 00:07:10,006 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2056] 2024-05-18 00:07:10,006 >> Total optimization steps = 75
[INFO|trainer.py:2057] 2024-05-18 00:07:10,007 >> Number of trainable parameters = 2,621,440
0%| | 0/75 [00:00<?, ?it/s]/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
Traceback (most recent call last):
File "/home/ybh/miniconda3/envs/nlpcc/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/cli.py", line 65, in main
run_exp()
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/accelerate/accelerator.py", line 2121, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/75 [00:00<?, ?it/s

Others

i changed lora to finetune internlm-chat-7b, but this error is not happend.

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 17, 2024
@gabriel-peracio
Copy link

Yes, same here, though in my case I tried it with internlm2-20b (base, non-chat)

The same configuration, but applied to internlm2-7b, appears to work (I did not allow it to conclude as I am not interested in that model)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

3 participants