Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] errors when restore checkpoint using lora finetuning #1492

Open
wenyisir opened this issue May 7, 2024 · 5 comments
Open

[Usage] errors when restore checkpoint using lora finetuning #1492

wenyisir opened this issue May 7, 2024 · 5 comments

Comments

@wenyisir
Copy link

wenyisir commented May 7, 2024

Describe the issue

Issue:
When using LoRA fine-tuning to restore from a checkpoint, an error occurs, while there is no issue when not using LoRA fine-tuning to restore from a checkpoint. Can you explain why? How should I modify to save more parameters?
image

Command:

/home/wyxu/miniconda3/envs/llava/bin/deepspeed --master_port 25675 \
          --include localhost:3,4,5,6 \
          /home/wyxu/LLaVA/llava/train/train_mem.py \
          --lora_enable True \
          --deepspeed /home/wyxu/LLaVA/scripts/zero2.json \
          --model_name_or_path /data/wyxu/LLaVA/checkpoints/vicuna-7b-v1.3 \
          --version v1 \
          --data_path /data/wyxu/MIC_sampled/data/ \
          --image_folder /data/wyxu/MIC_sampled/data/ \
          --vision_tower /data/wyxu/LLaVA/checkpoints/clip-vit-large-patch14 \
          --pretrain_mm_mlp_adapter /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-pretrain/mm_projector.bin \
          --mm_vision_select_layer -2 \
          --mm_use_im_start_end False \
          --mm_use_im_patch_token False \
          --bf16 True \
          --output_dir /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-finetune-on-mic_sampled-lora \
          --num_train_epochs 10 \
          --per_device_train_batch_size 4 \
          --per_device_eval_batch_size 1 \
          --gradient_accumulation_steps 4 \
          --evaluation_strategy no \
          --save_strategy steps \
          --save_steps 90 \
          --save_total_limit 1 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type cosine \
          --logging_steps 1 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --dataloader_num_workers 4 \
          --lazy_preprocess True \
          --report_to wandb

Log:

Traceback (most recent call last):
  File "/home/wyxu/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/home/wyxu/LLaVA/llava/train/train.py", line 1037, in train
    trainer.train(resume_from_checkpoint=True)
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
  deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
    self.module.load_state_dict(
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.up_proj.base_layer.weight"
......

Screenshots:
image

@1uciusy
Copy link

1uciusy commented May 8, 2024

It will check the folder --output_dir /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-finetune-on-mic_sampled-lora for the latest checkpoint-xxxx and resume to train.

@1uciusy
Copy link

1uciusy commented May 8, 2024

As for the missmatch of state_dict

pip install transformers==4.39.3
pip install accelerate==0.27.2

It is mentioned in some issues, but i forgot which it is

@tetsu-kikuchi
Copy link

might be this one #1200

@wenyisir
Copy link
Author

wenyisir commented May 9, 2024

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

@1uciusy
Copy link

1uciusy commented May 9, 2024

Great, so there is no need to change the version of transformers, you could avoid potential troubles as in #1218 when infering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants