Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: mat1 and mat2 must have the same dtype #63

Open
Crystalxd opened this issue Mar 18, 2024 · 0 comments
Open

RuntimeError: mat1 and mat2 must have the same dtype #63

Crystalxd opened this issue Mar 18, 2024 · 0 comments

Comments

@Crystalxd
Copy link

logs that inferred custom trained model (LanguageBind + Qwen14B LLM).

(moellava) root@ps:/code/MoE-LLaVA# CUDA_VISIBLE_DEVICES=0 python predict.py 
[2024-03-18 02:02:14,276] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
  warnings.warn(
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
  warnings.warn(
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
projector_type: mlp2x_gelu
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.07s/it]
Some weights of the model checkpoint at /output/llava-qwen14/checkpoint-200/ were not used when initializing LlavaQWenForCausalLM: ['transformer.image_tower.image_tower.embeddings.class_embedding', 'transformer.image_tower.image_tower.embeddings.patch_embedding.weight', 'transformer.image_tower.image_tower.embeddings.position_embedding.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.post_layernorm.bias', 'transformer.image_tower.image_tower.post_layernorm.weight', 'transformer.image_tower.image_tower.pre_layrnorm.bias', 'transformer.image_tower.image_tower.pre_layrnorm.weight']
- This IS expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
LlavaQWenForCausalLM(
  (transformer): LlavaQWenModel(
    (wte): Embedding(152064, 5120)
    (drop): Dropout(p=0.0, inplace=False)
    (rotary_emb): RotaryEmbedding()
    (h): ModuleList(
      (0-39): 40 x QWenBlock(
        (ln_1): RMSNorm()
        (attn): QWenAttention(
          (c_attn): Linear(in_features=5120, out_features=15360, bias=True)
          (c_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (attn_dropout): Dropout(p=0.0, inplace=False)
        )
        (ln_2): RMSNorm()
        (mlp): QWenMLP(
          (w1): Linear(in_features=5120, out_features=13696, bias=False)
          (w2): Linear(in_features=5120, out_features=13696, bias=False)
          (c_proj): Linear(in_features=13696, out_features=5120, bias=False)
        )
      )
    )
    (ln_f): RMSNorm()
    (image_tower): LanguageBindImageTower()
    (mm_projector): build_projector(
      (image_spatial_proj): Sequential(
        (0): Linear(in_features=1024, out_features=5120, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=5120, out_features=5120, bias=True)
      )
      (video_patch_proj): Identity()
      (video_spatial_proj): Identity()
      (video_temproal_proj): Identity()
      (video_global_proj): Identity()
    )
  )
  (lm_head): Linear(in_features=5120, out_features=152064, bias=False)
)
/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
  warnings.warn(
ASSISTANT: What is the man in the picture doing?
2024-03-18 02:02:45.831 | WARNING  | __main__:main:38 - ==================
tensor([[    32,   6236,   1948,    264,  22208,   1196,    323,    458,  20443,
          11229,  17847,     13,    576,  17847,   6696,  10950,     11,  11682,
             11,    323,  47787,  11253,    311,    279,   1196,    594,   4755,
             13,  13872,     25,    220,   -200,    198,  45930, 101047, 102015,
          18493, 106428,     30,  35560,   3846,   2821,     25]],
       device='cuda:0')
tensor([[[[-1.7891, -1.7891, -1.7891,  ..., -1.7930, -1.7930, -1.7891],
          [-1.7627, -1.7676, -1.7627,  ..., -1.7686, -1.7559, -1.7637],
          [-1.7461, -1.7480, -1.7471,  ..., -1.7490, -1.7363, -1.7520],
          ...,
          [-1.7285, -1.7344, -1.6748,  ..., -1.7461, -1.7266, -1.7402],
          [-1.7686, -1.7510, -1.7715,  ..., -1.7949, -1.7402, -1.7734],
          [-1.7832, -1.7910, -1.7852,  ..., -1.7891, -1.7900, -1.7930]],

         [[-1.7539, -1.7500, -1.7422,  ..., -1.7412, -1.7432, -1.7539],
          [-1.7002, -1.7051, -1.7021,  ..., -1.7422, -1.7246, -1.7119],
          [-1.6758, -1.6797, -1.6826,  ..., -1.6777, -1.6650, -1.6807],
          ...,
          [-1.6445, -1.6914, -1.6289,  ..., -1.7041, -1.6758, -1.6982],
          [-1.7168, -1.7119, -1.7451,  ..., -1.7383, -1.7158, -1.7324],
          [-1.7432, -1.7471, -1.7451,  ..., -1.7490, -1.7529, -1.7510]],

         [[-1.4814, -1.4814, -1.4814,  ..., -1.4834, -1.4834, -1.4834],
          [-1.4434, -1.4473, -1.4463,  ..., -1.4082, -1.3926, -1.3936],
          [-1.3838, -1.3867, -1.3867,  ..., -1.3975, -1.3994, -1.3994],
          ...,
          [-1.4268, -1.4434, -1.3691,  ..., -1.4326, -1.4082, -1.4443],
          [-1.4775, -1.4551, -1.4697,  ..., -1.4756, -1.4756, -1.4590],
          [-1.4629, -1.4678, -1.4668,  ..., -1.4697, -1.4736, -1.4805]]]],
       device='cuda:0', dtype=torch.float16)
2024-03-18 02:02:45.843 | WARNING  | __main__:main:41 - ==================
++++++++++++++++
tensor([[[-0.9653,  0.5757, -1.1807,  ...,  1.2803, -0.7188, -0.8818],
         [ 0.4961,  2.7051,  0.0115,  ...,  0.6382, -0.6060, -0.5703],
         [ 0.1807,  0.8447,  0.4824,  ...,  1.0771, -0.0136, -1.2354],
         ...,
         [ 0.6416,  1.0879, -0.5303,  ...,  1.1309, -0.9102, -0.0253],
         [ 0.3801,  3.1152, -0.9663,  ..., -0.0643, -0.4917,  1.3672],
         [-0.8354,  0.7363, -1.6709,  ...,  1.4736, -0.3210, -0.8779]]],
       device='cuda:0', dtype=torch.float16)
image_feature_shape: torch.Size([1, 256, 1024])
Traceback (most recent call last):
  File "/code/MoE-LLaVA/predict.py", line 57, in <module>
    main()
  File "/code/MoE-LLaVA/predict.py", line 44, in main
    output_ids = model.generate(
  File "/code/MoE-LLaVA/moellava/model/language_model/qwen/modeling_qwen.py", line 1260, in generate
    return super().generate(
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1520, in generate
    return self.sample(
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2617, in sample
    outputs = self(
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/code/MoE-LLaVA/moellava/model/language_model/llava_qwen.py", line 147, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 458, in prepare_inputs_labels_for_multimodal
    image_features_minibatch = self.encode_images(images_minibatch)  # [mini_b, l, c]
  File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 155, in encode_images
    image_features = self.get_model().mm_projector.forward_image(image_features)
  File "/code/MoE-LLaVA/moellava/model/multimodal_projector/builder.py", line 140, in forward_image
    return self.image_spatial_proj(image_feature)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype

pretrain.sh

JSON_FOLDER="/data/llava_pt/json"
IMAGE_FOLDER="/data"
# cd ~/MoE-LLaVA
# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1
CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path /model/Qwen-14B \
    --version plain \
    --data_path ${JSON_FOLDER}/llava_image_.json \
    --image_folder ${IMAGE_FOLDER} \
    --image_tower /model/LanguageBind/LanguageBind_Image \
    --image_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir /output/llavaqwen-14b-pretrain \
    --num_train_epochs 1.5 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 4096 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"

predict.py

import torch
from PIL import Image
from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from moellava.conversation import conv_templates, SeparatorStyle
from moellava.model.builder import load_pretrained_model
from moellava.utils import disable_torch_init
from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from loguru import logger as log

def main():
    disable_torch_init()
    # image = 'moellava/serve/examples/extreme_ironing.jpg'
    # inp = 'What is unusual about this image?'
    image = '/data/lrv_tune/images/2371990.jpg'
    inp = 'What is the man in the picture doing?'
    model_path = '/output/llava-qwen14/checkpoint-200/'  # choose a model
    device = 'cuda'
    load_4bit, load_8bit = False, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    image_processor = processor['image']
    conv_mode = "qwen"  # phi or qwen or stablelm
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)


    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    log.warning("==================")
    print(input_ids)
    print(image_tensor)
    log.warning("==================")

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()
    print(outputs)

if __name__ == '__main__':
    main()
    '''
    deepspeed predict.py
    '''

finetune.sh

#!/bin/bash

JSON_FOLDER="/data/lrv_tune/json"
IMAGE_FOLDER="/data"
cd /code/MoE-LLaVA
CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \
    --deepspeed ./scripts/zero2_offload.json \
    --model_name_or_path /model/Qwen-14B \
    --version qwen \
    --data_path ${JSON_FOLDER}/chinese_lrv_tune_50k.json \
    --image_folder ${IMAGE_FOLDER} \
    --image_tower /model/LanguageBind/LanguageBind_Image \
    --image_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /output/llavaqwen-14b-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir /output/llava-qwen14 \
    --num_train_epochs 2.3 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 4096 \
    --gradient_checkpointing True \
    --dataloader_num_workers 16 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant