Skip to content

engindeniz/vitis

Repository files navigation

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts

Project page | arXiv

Model

ViTiS consists of a frozen video encoder, a visual mapping network, a frozen text embedding layer, a frozen language model and a frozen classifier head. Given input video frames and text, video encoder extracts frame features and the visual mapping network maps them to the same space as the text embeddings obtained by text embedding layer. Then, the language model takes the video and text embeddings as input and predicts the masked input tokens.

The language model incorporates learnable text prompts in the key and value of multi-head-attention and adapter layers after each self-attention and feed-forward layer, before LayerNorm.

Our visual mapping network consists of a number of layers, each performing cross-attention between learnable visual prompts and video frame features followed by self-attention.

Setup

To set up a conda environment:

conda env create -f vitis.yml 
conda activate vitis
pip install git+https://github.com/openai/CLIP.git
conda update ffmpeg

Data Preparation

This repository contains both ready-to-use data and guidelines for processing raw data.

Processed Data
  • Download processed downstream datasets from this link and place them in the data folder.
    • Note that datasets are prepared by following here, features are extracted by each dataset.
    • Note that subtitles, vocabulary files, and data splits are obtained from here.
    • Due to storage limitations, WebVid2M features are unavailable.
Raw Data Processing Guidelines
Click for more details.
  • Download the WebVid2M and extract it in the data/WebVid.
  • Download the MSRVTT-QA & MSVD-QA and extract it in the data/MSRVTT-QA and data/MSVD-QA.
    • Note that YouTube mapping file should be downloaded from here for MSVD-QA dataset.
  • Download the ActivityNet-QA and extract it in the data/ActivityNet-QA.
  • Download the TGIF-FrameQA and extract it in the data/TGIF-QA.
  • For all datasets, videos should be placed in the data/<dataset_name>/videos folder.
  • For all datasets, download subtitles, vocabulary files, and data splits csv files from this link.
Feature Extraction for downstream datasets
  • Prepare video id list for all datasets:
python extract/prepare_video_ids_for_all_datasets.py
  • Download CLIP model to checkpoints folder.
  • Extract video features for each dataset: <dataset_name> : {msrvtt | msvd | activitynet | tgif | webvid}
  • Extract video features for each dataset paths: <DATASET_PATH> : {MSRVTT-QA | MSVD-QA | ActivityNet-QA | TGIF-QA | WEBVID}
  • Create features folder in the data/<DATASET_PATH>
python extract/extract_video_features.py --dataset_name <dataset_name> \ 
--feature_extraction_csv data/<DATASET_PATH>/video_id_list.csv \
--feature_extraction_video_main_path data/<DATASET_PATH>/videos \
--feature_extraction_features_main_path data/<DATASET_PATH>/features
  • Merge video features for each dataset (except webvid):
python extract/merge_features.py --dataset <dataset_name> \
--folder data/<DATASET_PATH>/features \ 
--output_path data/<DATASET_PATH>/features/clipvitl14.pth
  • Merge video features for webvid:
python extract/create_hdf5.py

Pre-training

  • Download DebertaV2 model files to checkpoints/deberta-v2-xlarge folder from here.

  • To train ViTiS on Webvid2M, run the following code:

python -m torch.distributed.launch --nproc_per_node 8 --use_env main.py \
--combine_datasets webvid --combine_datasets_val webvid --save_dir==output_webvid --lr=2e-5 --different_lr_embedding_layers \
--batch_size=16 --batch_size_val=16 --epochs=10  --amp  \ 
--mapping_network_feedforward  --text_prompt_projection_layer

The other parameters are set to default. You can also check our paper. Note that pre-training is done on 8 Tesla V100 GPUs (32 GB).

Zero-shot evaluation

  • Download pre-trained model files to checkpoints folder from here.

  • To evaluate ViTiS for zero-shot, run the following code:

python -m torch.distributed.launch --nproc_per_node 1 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--batch_size_val=32  --amp --mapping_network_feedforward  --text_prompt_projection_layer \
--<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json --load checkpoints/vitis_pretraining_zero_shot.pth --eval --test \

Few-shot fine-tuning

  • Download pre-trained model file to checkpoints folder from here.
  • Note that zero-shot and few-shot checkpoints are taken from different epoch.
  • We choose the vocabulary that yields the best performance on the validation set.
  • Note that fine-training is done on 4 Tesla V100 GPUs (32 GB).

All trainable model parameters fine-tuned

  • For few-shot fine-tuning all trainable params, run the following code:
python -m torch.distributed.launch --nproc_per_node 4 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--save_dir==output_few_shot --lr=1e-5 --different_lr_embedding_layers \
--amp --mapping_network_feedforward  --text_prompt_projection_layer \
--batch_size=8 --batch_size_val=32 --epochs=20  --<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json   \ 
--load checkpoints/vitis_pretraining_few_shot.pth
  • Note that the base learning rate is searched over 5 values in the interval [10−5 , 5 × 10−5], while the learning rate for visual and text prompts is kept at 10−3.

Only prompts fine-tuned

  • Download saved prompts file to checkpoints folder from here.
  • For few-shot fine-tuning only prompts, run the following code:
python -m torch.distributed.launch --nproc_per_node 4 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--save_dir==output_few_shot --lr=1e-2 --amp --mapping_network_feedforward --batch_size=8 --batch_size_val=32 --epochs=20 \ 
--<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json   \ 
--load checkpoints/vitis_pretraining_few_shot.pth --loaded_prompts text --only_finetune_loaded_prompts visual_text \
  • Note that the base learning rate is searched over 3 values in the interval [10−2 , 3 × 10−2].

License

This code is released under the Apache License 2.0.

Acknowledgments

The code is written based on FrozenBiLM.
The prompt learning code is inspired by P-tuning-v2.

Citation

If this code is helpful for you, please cite the following:

@inproceedings{engin_2023_ICCV,
    title={Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts},
    author={Engin, Deniz and Avrithis, Yannis},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    year={2023}
}