errors while in finetune intermlm2-chat-20b with qlora #3798

a1exyu · 2024-05-17T16:11:43Z

Reminder

I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=1 llamafactory-cli example/......
below is the yaml file:

model

model_name_or_path: /home/ybh/ybh/models/internlm2-chat-20b
quantization_bit: 4

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: wqkv

dataset

dataset: text_classification_coarse
template: intern2
cutoff_len: 6144
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /home/ybh/ybh/nlpcc/LLaMA-Factory/saves/internlm2-chat-20b/qlora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

eval

val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 10

Expected behavior

No response

System Info

[INFO|trainer.py:2048] 2024-05-18 00:07:10,006 >> ***** Running training *****
[INFO|trainer.py:2049] 2024-05-18 00:07:10,006 >> Num examples = 122
[INFO|trainer.py:2050] 2024-05-18 00:07:10,006 >> Num Epochs = 5
[INFO|trainer.py:2051] 2024-05-18 00:07:10,006 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2054] 2024-05-18 00:07:10,006 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2055] 2024-05-18 00:07:10,006 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2056] 2024-05-18 00:07:10,006 >> Total optimization steps = 75
[INFO|trainer.py:2057] 2024-05-18 00:07:10,007 >> Number of trainable parameters = 2,621,440
0%| | 0/75 [00:00<?, ?it/s]/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
Traceback (most recent call last):
File "/home/ybh/miniconda3/envs/nlpcc/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/cli.py", line 65, in main
run_exp()
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/accelerate/accelerator.py", line 2121, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/75 [00:00<?, ?it/s

Others

i changed lora to finetune internlm-chat-7b, but this error is not happend.

gabriel-peracio · 2024-05-19T14:35:45Z

Yes, same here, though in my case I tried it with internlm2-20b (base, non-chat)

The same configuration, but applied to internlm2-7b, appears to work (I did not allow it to conclude as I am not interested in that model)

hiyouga added the pending This problem is yet to be addressed. label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors while in finetune intermlm2-chat-20b with qlora #3798

errors while in finetune intermlm2-chat-20b with qlora #3798

a1exyu commented May 17, 2024

gabriel-peracio commented May 19, 2024

errors while in finetune intermlm2-chat-20b with qlora #3798

errors while in finetune intermlm2-chat-20b with qlora #3798

Comments

a1exyu commented May 17, 2024

Reminder

Reproduction

model

method

dataset

output

train

eval

Expected behavior

System Info

Others

gabriel-peracio commented May 19, 2024