SUTS-for-CVLMs

This repository contains a spatial understanding (SU) test suite (TS) for vision-language models (VLMs) in fulfillment of the project option for the CLMS degree from the University of Washington. Read the paper at docs/Final Paper.pdf

The test suite consists of pairs of true and false sentences which truly or falsely describe a caption. THe goal of the VLM is to identify the true caption. By using different sentence structure, I can test what lingustic features affect the performance of VLM.

The two steps to creating this data set were:

Synthetic Image Generation: I used Unity to generate images (not included in this repo).
Sentence Generation: From the spatial relation metadata associated with the images made in the previous step, I use python to generate a set of true and false sentences.

I tested CLIP's performance on this test suite, and it performed very poorly, at or worse than random guessing.

More information about this test suite can be found at the the wiki.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
doc		doc
eval		eval
images		images
src		src
README.md		README.md
sentences.txt		sentences.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

eval

eval

images

images

src

src

README.md

README.md

sentences.txt

sentences.txt

Repository files navigation

SUTS-for-CVLMs

About

Releases

Packages

Languages

DavidK0/SUTS-for-VLMs

Folders and files

Latest commit

History

Repository files navigation

SUTS-for-CVLMs

About

Topics

Resources

Stars

Watchers

Forks

Languages