Skip to content

retkowsky/ViLT

Repository files navigation

Visual Question Answering with ViLT

ViLT = Vision-and-Language Pre-training

The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).

https://huggingface.co/docs/transformers/model_doc/vilt

Notebook

Python notebook demo

Demo

20-Jan-2023 Serge Retkowsky | serge.retkowsky@microsoft.com | https://www.linkedin.com/in/serger/