Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

CVPR 2024

Haoning Wu¹^*, Zicheng Zhang²^*, Erli Zhang¹^*, Chaofeng Chen¹, Liang Liao¹, Annan Wang¹, Kaixin Xu⁴,

Chunyi Li², Jingwen Hou¹, Guangtao Zhai², Geng Xue⁴, Wenxiu Sun³, Qiong Yan³, Weisi Lin¹^#

¹Nanyang Technological University, ²Shanghai Jiaotong University, ³Sensetime Research, ⁴I2R@A*STAR

^*Equal contribution. ^#Corresponding author.

Dataset | Model Zoo | Paper (Preview) | Demo (Hugging Face)

Build Local Demos

We have now supported to run the Q-Instruct demos on your own device!

See local demos for instructions. (Now support mplug_owl-2 only)

Quicker Start: the Scorer API

Install dependencies:

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl2/ 
pip install -e .

Create a scorer:

from boost_qa.model import QInstructScorer as Scorer
scorer = Scorer(boost=False)

The scorer takes a PIL Image (or a list of PIL Images) as input:

from PIL import Image
images = [Image.open("fig/sausage.jpg"), Image.open("fig/examples_211.jpg")]
print(scorer(images).tolist())

The output should be [0.429931640625, 0.204833984375].

Quick Start

If your server is facing a poor connection to Hugging Face, we provide an alternative way to Download Weights from ModelScope. Click in to see details.

对于中国大陆地区的使用者，若您的服务器连接Hugging Face存在一些困难，我们亦提供通过魔搭下载权重的方式。敬请点击参阅指南。

LLaVA-v1.5

Install LLaVA.

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Simple Interactive Demos.

See the codes and scripts below.

Example Code (Single Query)

from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
model_path = "teowu/llava_v1.5_7b_qinstruct_preview_v0.1" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)

Example Code (CLI Demo for Multi-turn Conversation)

python -m llava.serve.cli \
    --model-path teowu/llava_v1.5_7b_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/llava_v1.5/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/llava_v1.5/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/llava_v1.5/eval_video_quality.py

mPLUG-Owl-2

For mPLUG-Owl-2, Only Single GPU Inference is supported now. Please set environmental variable (e.g. export CUDA_VISIBLE_DEVICES=0) to make sure that the model can be loaded on only one device.

Install mPLUG-Owl-2.

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl2/ 
pip install -e .

Simple Interactive Demos

Example Code (Single Query)

from mplug_owl2.mm_utils import get_model_name_from_path
from eval_scripts.mplug_owl_2.run_mplug_owl2 import eval_model
model_path = "teowu/mplug_owl2_7b_448_qinstruct_preview_v0.2" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)

Example Code (CLI Demo for Multi-turn Conversation)

python -m mplug_owl2.serve.cli \
    --model-path teowu/mplug_owl2_7b_448_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/mplug_owl_2/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/mplug_owl_2/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/mplug_owl_2/eval_video_quality.py

InternLM-XComposer-VL

InternLM-XComposer-VL has been integrated into Huggingface AutoModel (remote code mode). You can directly start with the code below without a separate install process.

Simple Interactive Demos

Example Code (Single Query)

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Single-Turn Text-Image Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response = model.generate(text, image)
print(response)

Example Code (Multi-Turn Conversation)

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Multi-Turn Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response, history = model.chat(text, image, history=None)
print(f'User: {text}')
print(f'Bot: {response}')

text = 'Which part of the pan is clearer, the top part of the bottom part?'
response, history = model.chat(text=text, image=None, history=history)
print(f'User: {text}')
print(f'Bot: {response}')

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/internlm_xcomposer_vl/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_video_quality.py

Model Zoo

See Model Zoo. Both huggingface and modelscope weights are provided.

Training

License

Researchers and open-source developers are free to use the Q-Instruct dataset and the fine-tuned weights as provided for the four MLLMs. We also allow commercial use, while any commercial use should be pre-permitted by our team. Please email haoning001@e.ntu.edu.sg to gain the permission for commercial use.

Citation

If you consider this work interesting, please feel free to cite it in your work!

@misc{wu2023qinstruct,
      title={Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models}, 
      author={Haoning Wu and Zicheng Zhang and Erli Zhang and Chaofeng Chen and Liang Liao and Annan Wang and Kaixin Xu and Chunyi Li and Jingwen Hou and Guangtao Zhai and Geng Xue and Wenxiu Sun and Qiong Yan and Weisi Lin},
      year={2023},
      eprint={2311.06783},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

The results are upon the Q-Bench, whose bibtex is provided as follows:

@misc{wu2023qbench,
      title={Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision}, 
      author={Haoning Wu and Zicheng Zhang and Erli Zhang and Chaofeng Chen and Liang Liao and Annan Wang and Chunyi Li and Wenxiu Sun and Qiong Yan and Guangtao Zhai and Weisi Lin},
      year={2023},
      eprint={2309.14181},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
benchmark_results		benchmark_results
boost_qa		boost_qa
eval_scripts		eval_scripts
fig		fig
local_demos		local_demos
model_zoo		model_zoo
scripts		scripts
LICENSE		LICENSE
README.md		README.md
_config.yaml		_config.yaml
new_q_instruct.png		new_q_instruct.png
q_instruct_logo.png		q_instruct_logo.png

License

Q-Future/Q-Instruct

Folders and files

Latest commit

History

Repository files navigation

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Build Local Demos

Quicker Start: the Scorer API

Quick Start

LLaVA-v1.5

Install LLaVA.

Simple Interactive Demos.

Quantitative Evaluations

mPLUG-Owl-2

Install mPLUG-Owl-2.

Simple Interactive Demos

Quantitative Evaluations

InternLM-XComposer-VL

Simple Interactive Demos

Quantitative Evaluations

Model Zoo

Training

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages