Skip to content

zer0int/CLIP-text-image-interpretability

Repository files navigation

CLIP text-image interpretability

Visual Interpretability / XAI Tool for CLIP ViT (Vision Transformer) models

banner

Credits & Prerequisites

Overview

In simple terms: Feeds an image to a CLIP ViT vision transformer to obtain "a CLIP opinion" / words (text tokens) about the image (gradient ascent), then uses the [token] + [image] pair to visualize what CLIP is "looking at" (attention visualization), producing an overlay "heatmap" image.

Setup

  1. Install OpenAI/CLIP and hila-chefer/Transformer-MM-Explainability
  2. Put the contents of this repo into the "/Transformer-MM-Explainability" folder
  3. Execute "python runall.py" from the command line, follow instructions
  4. Or run the individual scripts separately, check runall.py for details
  5. You should have most requirements from the prequisite installs (1.), except maybe kornia ("pip install kornia")
  6. Requires a minimum amount of 4 GB VRAM (CLIP ViT-B/32). Check clipga.py and adjust batch size if you get a CUDA OOM, or to use a different model
  7. Use the same CLIP ViT model for clipga.py and runexplain.py (defined at top of code, "clipmodel=")... Or experiment around!

What does a vision transformer "see"?

  • Find out what CLIP's attention is on for a given image, explore bias as well as sophistication and broad concepts learned by the AI
  • Use CLIP's "opinion" + heatmap image verification, then try to prompt your favorite text-to-image AI with those tokens. YES! Even the "crazy tokens"; after all, it's a CLIP steering the image towards your prompt inside a text-to-image AI system!

Examples:

what-clip-sees

attention-guided

interoperthunderbirds


About

Get CLIP ViT text tokens about an image, visualize attention as a heatmap.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages