Skip to content

Latest commit

 

History

History

EVA-02

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling.

With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets.

Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data.

We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.

We hope our efforts enable a broader range of the research community to advance the field in a more efficient, affordable and equitable manner.

Summary of EVA-02 performance

summary_tab

summary_tab

Get Started

Best Practice

  • If you would like to use / fine-tune EVA-02 in your project, please start with a shorter schedule & smaller learning rate (compared with the baseline setting) first.
  • Using EVA-02 as a feature extractor: #56.

BibTeX & Citation

@article{EVA02,
  title={EVA-02: A Visual Representation for Neon Genesis},
  author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2303.11331},
  year={2023}
}

Acknowledgement

EVA-01, BEiT, BEiTv2, CLIP, MAE, timm, DeepSpeed, Apex, xFormer, detectron2, mmcv, mmdet, mmseg, ViT-Adapter, detrex, and rotary-embedding-torch.

Contact

  • For help and issues associated with EVA-02, or reporting a bug, please open a GitHub Issue with label EVA-02. Let's build a better & stronger EVA-02 together :)

  • We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (caoyue@baai.ac.cn) and Xinlong Wang (wangxinlong@baai.ac.cn).