Skip to content

Latest commit

 

History

History
25 lines (16 loc) · 1.85 KB

flamingo.md

File metadata and controls

25 lines (16 loc) · 1.85 KB

June 2023

tl;dr: A visually-conditioned autoregressive text generation models. It takes in interleaved text token and images/videos, and produce texts as output.

Overall impression

Flamingo teaches LLM how to "see". A frozen vision encoder and frozen LLM decoder is used, only with adaptor layers learned. It augments pretrained language models with a mechanism to directly attend to a single context image. The visual-condition is done via the adaptor mechanism. Flamingo promotes modular design of AGI.

Strong performance with few-shot prompts can be done for image and video understanding tasks such as classification, captioning, or question-answering: these can be cast as text prediction problems with visual input conditioning. Note that these vision language tasks have language as the natural form of output. For vision-centric tasks such as object detection, see models such as pix2seq, pix2seq v2 and VisionLLM.

The challenge is to inject a multimodal prompt containing images, interleaved with text.

Key ideas

  • Vision backbone and LLM decoder is frozen.
  • Add perceiver sampler to keep the vision feature tokens the same number and benefit from the larger image size in a scalable way.
  • Insert gated cross-attention dense blocks between the original and frozen llm block layers, trained from scratch. Tanh was used which generates the same results at initialization.

Technical details

  • The model can take in high resolution images and videos, as it uses perceiver structure that can produce a small number of visual token per image/video.

Notes

  • Questions and notes on how to improve/revise the current work