All the images used in CARETS come from the GQA validation set. If you've already downloaded the GQA dataset, you may set the images_root
element in the dataset config to be the GQA images directory. Otherwise, you have two options: 1) download all the images for the GQA dataset from here (20GB) or 2) download just the subset of images that we use with the script below (1.3GB).
cd CARETS
export DATADIR=data # where to store images directory
export TARNAME=images.tar.gz
wget --save-cookies pbbxvrf.txt 'https://drive.google.com/uc?id=1Yi_Zgbn0rraekBV96Vwmg9kOuv72b1Lt&export=download' -O- \
| sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > pbasvez.txt && \
wget --load-cookies pbbxvrf.txt -O $TARNAME \
'https://drive.google.com/uc?id=1Yi_Zgbn0rraekBV96Vwmg9kOuv72b1Lt&export=download&confirm='$(<pbasvez.txt) && \
tar -xzf $TARNAME -C $DATADIR
rm -f pbbxvrf.txt pbasvez.txt $TARNAME
Datasets are defined by YAML configuration using the following basic format:
images_root: data/images/ # *.jpg image files
files_root: data/questions/ # *.json files
tests:
rephrasing_invariance:
eval_type: invariance # (invariance | directional_expectation)
files:
- rephrasing_file_1.json # filename located in data/questions/
...
...
The configuration for the default evaluation for non-visual perturbations can be found under configs/defaults.yml
A CaretsDataset object is a collection of tests ingested using the configuration file. Each test corresponds to a question split with pairs of questions and evaluates a particular type of capability (e.g. rephrasing invariance or negation directional expectation). The CaretsDataset object can be used to iterate over the questions and their metadata, including the image id and image_path
.
Note: we shall soon introduce a TorchCaretsDataset that will be more easily compatible with PyTorch DataLoaders.
import random
from carets import CaretsDataset
dataset = CaretsDataset('./configs/default.yml')
predictions = dict()
for test_name, split in dataset.splits:
for question in split:
question_id = question['question_id']
img_path = question['image_path']
question = question['sent']
predictions[question_id] = random.choice(['cat', 'yes', 'no', 'red'])
for test_name, split in dataset.splits:
accuracy = split.total_accuracy(predictions)
consistency = split.evaluate(predictions)
comprehensive_accuracy = split.comprehensive_accuracy(predictions)
eval_type = split.eval_type
print(f'{test_name.ljust(24)}: accuracy: {accuracy:.3f}, {eval_type.ljust(24)}:' + \
f' {consistency:.3f}, comprehensive_accuracy: {comprehensive_accuracy:.3f}')
Coming soon...
@inproceedings{jimenez2022carets,
title={CARETS: A Consistency And Robustness Evaluative Test Suite for VQA},
author={Carlos E. Jimenez and Olga Russakovsky and Karthik Narasimhan},
booktitle={60th Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2022}
}