Skip to content

rshaojimmy/JiuTian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

JiuTian-LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024

[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]

🔥 Details will be released. Stay tuned 🍻 👍

Hits


If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

Introduction

This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.

The framework of the proposed LION model:

Evaluation results

For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.

Score

Image-level Region-level

We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.

MMBench POPE

Qualitative Comparison

Qualitative Comparison Qualitative Comparison Qualitative Comparison

More Examples

Qualitative Comparison

Citation

If you find this work useful for your research, please kindly cite our paper:

@inproceedings{chen2024lion,
    title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge}, 
    author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}

About

[CVPR 2024] JiuTian, a Multimodal Large Language Model from HITSZ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published