We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logs that inferred custom trained model (LanguageBind + Qwen14B LLM).
(moellava) root@ps:/code/MoE-LLaVA# CUDA_VISIBLE_DEVICES=0 python predict.py [2024-03-18 02:02:14,276] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead. warnings.warn( /opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead. warnings.warn( /opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". projector_type: mlp2x_gelu Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00, 4.07s/it] Some weights of the model checkpoint at /output/llava-qwen14/checkpoint-200/ were not used when initializing LlavaQWenForCausalLM: ['transformer.image_tower.image_tower.embeddings.class_embedding', 'transformer.image_tower.image_tower.embeddings.patch_embedding.weight', 'transformer.image_tower.image_tower.embeddings.position_embedding.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.post_layernorm.bias', 'transformer.image_tower.image_tower.post_layernorm.weight', 'transformer.image_tower.image_tower.pre_layrnorm.bias', 'transformer.image_tower.image_tower.pre_layrnorm.weight'] - This IS expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). LlavaQWenForCausalLM( (transformer): LlavaQWenModel( (wte): Embedding(152064, 5120) (drop): Dropout(p=0.0, inplace=False) (rotary_emb): RotaryEmbedding() (h): ModuleList( (0-39): 40 x QWenBlock( (ln_1): RMSNorm() (attn): QWenAttention( (c_attn): Linear(in_features=5120, out_features=15360, bias=True) (c_proj): Linear(in_features=5120, out_features=5120, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): RMSNorm() (mlp): QWenMLP( (w1): Linear(in_features=5120, out_features=13696, bias=False) (w2): Linear(in_features=5120, out_features=13696, bias=False) (c_proj): Linear(in_features=13696, out_features=5120, bias=False) ) ) ) (ln_f): RMSNorm() (image_tower): LanguageBindImageTower() (mm_projector): build_projector( (image_spatial_proj): Sequential( (0): Linear(in_features=1024, out_features=5120, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=5120, out_features=5120, bias=True) ) (video_patch_proj): Identity() (video_spatial_proj): Identity() (video_temproal_proj): Identity() (video_global_proj): Identity() ) ) (lm_head): Linear(in_features=5120, out_features=152064, bias=False) ) /opt/conda/envs/moellava/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( ASSISTANT: What is the man in the picture doing? 2024-03-18 02:02:45.831 | WARNING | __main__:main:38 - ================== tensor([[ 32, 6236, 1948, 264, 22208, 1196, 323, 458, 20443, 11229, 17847, 13, 576, 17847, 6696, 10950, 11, 11682, 11, 323, 47787, 11253, 311, 279, 1196, 594, 4755, 13, 13872, 25, 220, -200, 198, 45930, 101047, 102015, 18493, 106428, 30, 35560, 3846, 2821, 25]], device='cuda:0') tensor([[[[-1.7891, -1.7891, -1.7891, ..., -1.7930, -1.7930, -1.7891], [-1.7627, -1.7676, -1.7627, ..., -1.7686, -1.7559, -1.7637], [-1.7461, -1.7480, -1.7471, ..., -1.7490, -1.7363, -1.7520], ..., [-1.7285, -1.7344, -1.6748, ..., -1.7461, -1.7266, -1.7402], [-1.7686, -1.7510, -1.7715, ..., -1.7949, -1.7402, -1.7734], [-1.7832, -1.7910, -1.7852, ..., -1.7891, -1.7900, -1.7930]], [[-1.7539, -1.7500, -1.7422, ..., -1.7412, -1.7432, -1.7539], [-1.7002, -1.7051, -1.7021, ..., -1.7422, -1.7246, -1.7119], [-1.6758, -1.6797, -1.6826, ..., -1.6777, -1.6650, -1.6807], ..., [-1.6445, -1.6914, -1.6289, ..., -1.7041, -1.6758, -1.6982], [-1.7168, -1.7119, -1.7451, ..., -1.7383, -1.7158, -1.7324], [-1.7432, -1.7471, -1.7451, ..., -1.7490, -1.7529, -1.7510]], [[-1.4814, -1.4814, -1.4814, ..., -1.4834, -1.4834, -1.4834], [-1.4434, -1.4473, -1.4463, ..., -1.4082, -1.3926, -1.3936], [-1.3838, -1.3867, -1.3867, ..., -1.3975, -1.3994, -1.3994], ..., [-1.4268, -1.4434, -1.3691, ..., -1.4326, -1.4082, -1.4443], [-1.4775, -1.4551, -1.4697, ..., -1.4756, -1.4756, -1.4590], [-1.4629, -1.4678, -1.4668, ..., -1.4697, -1.4736, -1.4805]]]], device='cuda:0', dtype=torch.float16) 2024-03-18 02:02:45.843 | WARNING | __main__:main:41 - ================== ++++++++++++++++ tensor([[[-0.9653, 0.5757, -1.1807, ..., 1.2803, -0.7188, -0.8818], [ 0.4961, 2.7051, 0.0115, ..., 0.6382, -0.6060, -0.5703], [ 0.1807, 0.8447, 0.4824, ..., 1.0771, -0.0136, -1.2354], ..., [ 0.6416, 1.0879, -0.5303, ..., 1.1309, -0.9102, -0.0253], [ 0.3801, 3.1152, -0.9663, ..., -0.0643, -0.4917, 1.3672], [-0.8354, 0.7363, -1.6709, ..., 1.4736, -0.3210, -0.8779]]], device='cuda:0', dtype=torch.float16) image_feature_shape: torch.Size([1, 256, 1024]) Traceback (most recent call last): File "/code/MoE-LLaVA/predict.py", line 57, in <module> main() File "/code/MoE-LLaVA/predict.py", line 44, in main output_ids = model.generate( File "/code/MoE-LLaVA/moellava/model/language_model/qwen/modeling_qwen.py", line 1260, in generate return super().generate( File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1520, in generate return self.sample( File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2617, in sample outputs = self( File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/code/MoE-LLaVA/moellava/model/language_model/llava_qwen.py", line 147, in forward ) = self.prepare_inputs_labels_for_multimodal( File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 458, in prepare_inputs_labels_for_multimodal image_features_minibatch = self.encode_images(images_minibatch) # [mini_b, l, c] File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 155, in encode_images image_features = self.get_model().mm_projector.forward_image(image_features) File "/code/MoE-LLaVA/moellava/model/multimodal_projector/builder.py", line 140, in forward_image return self.image_spatial_proj(image_feature) File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype
pretrain.sh
JSON_FOLDER="/data/llava_pt/json" IMAGE_FOLDER="/data" # cd ~/MoE-LLaVA # HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path /model/Qwen-14B \ --version plain \ --data_path ${JSON_FOLDER}/llava_image_.json \ --image_folder ${IMAGE_FOLDER} \ --image_tower /model/LanguageBind/LanguageBind_Image \ --image_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir /output/llavaqwen-14b-pretrain \ --num_train_epochs 1.5 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 1 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 4096 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "./cache_dir"
predict.py
import torch from PIL import Image from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN from moellava.conversation import conv_templates, SeparatorStyle from moellava.model.builder import load_pretrained_model from moellava.utils import disable_torch_init from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria from loguru import logger as log def main(): disable_torch_init() # image = 'moellava/serve/examples/extreme_ironing.jpg' # inp = 'What is unusual about this image?' image = '/data/lrv_tune/images/2371990.jpg' inp = 'What is the man in the picture doing?' model_path = '/output/llava-qwen14/checkpoint-200/' # choose a model device = 'cuda' load_4bit, load_8bit = False, False model_name = get_model_name_from_path(model_path) tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device) image_processor = processor['image'] conv_mode = "qwen" # phi or qwen or stablelm conv = conv_templates[conv_mode].copy() roles = conv.roles image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16) print(f"{roles[1]}: {inp}") inp = DEFAULT_IMAGE_TOKEN + '\n' + inp conv.append_message(conv.roles[0], inp) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda() stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) log.warning("==================") print(input_ids) print(image_tensor) log.warning("==================") with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria]) outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip() print(outputs) if __name__ == '__main__': main() ''' deepspeed predict.py '''
finetune.sh
#!/bin/bash JSON_FOLDER="/data/lrv_tune/json" IMAGE_FOLDER="/data" cd /code/MoE-LLaVA CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --deepspeed ./scripts/zero2_offload.json \ --model_name_or_path /model/Qwen-14B \ --version qwen \ --data_path ${JSON_FOLDER}/chinese_lrv_tune_50k.json \ --image_folder ${IMAGE_FOLDER} \ --image_tower /model/LanguageBind/LanguageBind_Image \ --image_projector_type mlp2x_gelu \ --pretrain_mm_mlp_adapter /output/llavaqwen-14b-pretrain/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir /output/llava-qwen14 \ --num_train_epochs 2.3 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 4096 \ --gradient_checkpointing True \ --dataloader_num_workers 16 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "./cache_dir"
The text was updated successfully, but these errors were encountered:
No branches or pull requests
logs that inferred custom trained model (LanguageBind + Qwen14B LLM).
pretrain.sh
predict.py
finetune.sh
The text was updated successfully, but these errors were encountered: