New ViT findings via registers (2309.16588) #184

Infinitay · 2023-10-01T21:15:34Z

There was a paper released very recently by Facebook, now Meta, and INRIA discovering and improvement when they added registers to ViT. I'm not too familiar in the space so I won't pretend to understand it but I will leave you with the abstract,

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Source: VISION TRANSFORMERS NEED REGISTERS

Would BLIP be able to benefit from this new technique?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New ViT findings via registers (2309.16588) #184

New ViT findings via registers (2309.16588) #184

Infinitay commented Oct 1, 2023 •

edited

New ViT findings via registers (2309.16588) #184

New ViT findings via registers (2309.16588) #184

Comments

Infinitay commented Oct 1, 2023 • edited

Infinitay commented Oct 1, 2023 •

edited