LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan

* Equal contributions

Mohamed bin Zayed University of AI (MBZUAI)

📢 Latest Updates

Apr-30-24- LLaMA-3-V and Phi-3-V demos are now available via Hugging Face Spaces. Check them out at LLaMA-3-V & Phi-3-V 🔥🔥🔥
Apr-28-24- Online demo of Phi-3-V and LLaMA-3-V are released, check them out at Online Demo 🔥🔥🔥
Apr-28-24- LoRA, fully fine-tuned and S² fine-tuned models and results are added! 🔥🔥🔥
Apr-27-24- Google Colab is released to chat with Phi-3-V-3.8B model, check it out at Google Colab 🔥🔥🔥
Apr-26-24- Phi-3-V and LLaVA-3-V released: Excited to release the new integration of LLaVA with Phi-3 Mini Instruct and LLaMA-3 Instruct models! Hugging Face 🔥🔥🔥

💬 Introduction

This repository enhances the capabilities of the LLaVA 1.5 model, incorporating latest LLMs released this weak🔥, Phi-3 Mini Instruct 3.8B, and LLaMA-3 Instruct 8B.

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

Average computed excluding MME, and second-best are underlined.

🤖 Model-Zoo

The following table provides an overview of the available models in our zoo. For each model, you can find links to its Hugging Face page.

Model Name	Hugging Face Link	Summary
LLaVA-Phi-3-mini-4k-instruct-pretrain	Hugging Face	Pretrained on LCS-558K.
LLaVA-Phi-3-mini-4k-instruct-lora	Hugging Face	LoRA weights fine-tuned on LLaVA-Instruct-665K.
LLaVA-Phi-3-mini-4k-instruct	Hugging Face	Merged LoRA weights in HuggingFace format.
LLaVA-Phi-3-mini-4k-instruct-FT	Hugging Face	Fully fine-tuned model weights in HuggingFace format.

Model Name	Hugging Face Link	Summary
LLaVA-Meta-Llama-3-8B-Instruct-pretrain	Hugging Face	Pretrained on LCS-558K.
LLaVA-Meta-Llama-3-8B-Instruct-lora	Hugging Face	LoRA weights fine-tuned on LLaVA-Instruct-665K.
LLaVA-Meta-Llama-3-8B-Instruct	Hugging Face	Merged weights in HuggingFace format.
LLaVA-Meta-Llama-3-8B-Instruct-FT	Hugging Face	Fully fine-tuned model weights in HuggingFace format.
LLaVA-Meta-Llama-3-8B-Instruct-FT-S2	Hugging Face	Fully fine-tuned S2 model weights in HuggingFace format.

Installation

git clone https://github.com/mbzuai-oryx/LLaVA-pp.git
cd LLaVA-pp
git submodule update --init --recursive

Packages you need to update from LLAVA:

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

🚀 Phi-3-V

To integrate Phi-3-V with LLaVA, follow these steps to update the codebase:

# Copy necessary files
cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/llava_phi3.py LLaVA/llava/model/language_model/llava_phi3.py
cp Phi-3-V/builder.py LLaVA/llava/model/builder.py
cp Phi-3-V/model__init__.py LLaVA/llava/model/__init__.py
cp Phi-3-V/main__init__.py LLaVA/llava/__init__.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

# Training commands
cp scripts/Phi3-V_pretrain.sh LLaVA/Vi-phi3_pretrain.sh
cp scripts/Phi3-V_finetune_lora.sh LLaVA/Vi-phi3_finetune_lora.sh

Train Phi-3-V

Pre-train

cd LLaVA
bash Phi3-V_pretrain.sh

Finetune

cd LLaVA
bash Phi3-V_finetune_lora.sh

🚀 LLaMA-3-V

To integrate LLaMA-3-V with LLaVA, follow these steps to update the codebase:

# Copy necessary files
cp LLaMA-3-V/train.py LLaVA/llava/train/train.py
cp LLaMA-3-V/conversation.py LLaVA/llava/conversation.py
cp LLaMA-3-V/builder.py LLaVA/llava/model/builder.py
cp LLaMA-3-V/llava_llama.py LLaVA/llava/model/language_model/llava_llama.py

# Training commands
cp scripts/LLaMA3-V_pretrain.sh LLaVA/LLaMA3-V_pretrain.sh
cp scripts/LLaMA3-V_finetune_lora.sh LLaVA/LLaMA3-V_finetune_lora.sh

Train LLaMA-3-V

Pre-train

cd LLaVA
bash LLaMA3-V_pretrain.sh

Finetune

cd LLaVA
bash LLaMA3-V_finetune_lora.sh

🙏 Acknowledgement

We are thankful to LLaVA, lmms-eval and S²-Wrapper for releasing their models and code as open-source contributions.

In case if you face any issues or have any questions, please feel free to create an issue or reach out at hanoona.bangalath@mbzuai.ac.ae & muhammad.maaz@mbzuai.ac.ae.

📜 Citation

  @misc{hanoona2024LLaVA++,
          title={LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3},
          author={Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad S.},
          url={https://github.com/mbzuai-oryx/LLaVA-pp},
          year={2024}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LLaMA-3-V		LLaMA-3-V
LLaVA @ 3e337ad		LLaVA @ 3e337ad
Phi-3-V		Phi-3-V
images		images
scripts		scripts
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA-3-V

LLaMA-3-V

LLaVA @ 3e337ad

LLaVA @ 3e337ad

Phi-3-V

Phi-3-V

images

images

scripts

scripts

.gitmodules

.gitmodules

README.md

README.md

Repository files navigation

LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan

Mohamed bin Zayed University of AI (MBZUAI)

📢 Latest Updates

💬 Introduction

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

🤖 Model-Zoo

Installation

🚀 Phi-3-V

Train Phi-3-V

🚀 LLaMA-3-V

Train LLaMA-3-V

🙏 Acknowledgement

📜 Citation

About

Contributors 2

Languages

mbzuai-oryx/LLaVA-pp

Folders and files

Latest commit

History

Repository files navigation

LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed*, Muhammad Maaz*, Salman Khan, and Fahad Khan

Mohamed bin Zayed University of AI (MBZUAI)

📢 Latest Updates

💬 Introduction

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

🤖 Model-Zoo

Installation

🚀 Phi-3-V

Train Phi-3-V

🚀 LLaMA-3-V

Train LLaMA-3-V

🙏 Acknowledgement

📜 Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan